Thursday, July 24, 2025

Using Python To Access archive.today, July 2025

It seems like a lot of the previous software wrappers to interact with archive.today (and archive.is, archive.ph, etc) via the command-line are either outdated or broken. So, here's a Python script to automatically submit links from the command-line to archive.today and retrieve their archived URLs.

From testing, it seems like it's best to keep the delay around 8 to 10 seconds. If you go too fast, Cloudflare will begin to yell at you and start throwing 429 errors.

As long as you've received a "WIP" URL from archive.today, it should be archived shortly after, though it may not appear immediately.

Add your own random user-agent. :)

'''

% python3 archiveToday.py --help
usage: archiveToday.py [-h] --urls URLS [--delay DELAY] [--output OUTPUT]
Batch archive URLs with archive.today
options:
  -h, --help       show this help message and exit
  --urls URLS      Path to file containing URLs (one per line)
  --delay DELAY    Delay between submissions in seconds
  --output OUTPUT  CSV file to save results
  
'''


import requests
import time
import os
import argparse
import csv
import re
from bs4 import BeautifulSoup  

def archive_url(session, url):
    try:
        print(f"Archiving: {url}")
        resp = session.get("https://archive.ph/submit/", params={"url": url}, allow_redirects=False)

        # If already archived, follow 302 redirect
        if resp.status_code == 302:
            archived_url = resp.headers.get("Location")
            # Match both 4 and 5 character archive slugs
            match = re.match(r"(https://archive\.ph/\w{4,5})", archived_url)
            if match:
                archived_url = match.group(1)
            print(f"Already archived: {archived_url}")
            return url, archived_url

        # If needs archiving, follow refresh to /wip/
        if resp.status_code == 200:
            refresh_header = resp.headers.get("refresh", "")
            match = re.search(r'url=(https?://[^\s]+)', refresh_header)
            if not match:
                print("WIP URL not found in refresh header.")
                return url, None

            wip_url = match.group(1)
            print(f"Archiving in progress (WIP): {wip_url}")

            final_resp = session.get(wip_url, allow_redirects=True)
            if final_resp.status_code == 200:
                archived_url = final_resp.url.replace("/wip/", "/")
                print(f"Archived: {archived_url}")
                return url, archived_url
            else:
                print(f"Failed to retrieve from WIP URL. Status: {final_resp.status_code}")
                return url, None

        print(f"Unexpected status code: {resp.status_code}")
        return url, None

    except Exception as e:
        print(f"Exception archiving {url}: {e}")
        return url, None

def read_urls_from_file(filename):
    if not os.path.exists(filename):
        print(f"File not found: {filename}")
        return []

    with open(filename, 'r') as f:
        urls = [
            line.strip()
            for line in f
            if line.strip() and not line.strip().startswith("#")
        ]
    return urls


def main():
    parser = argparse.ArgumentParser(description="Batch archive URLs with archive.today")
    parser.add_argument("--urls", required=True, help="Path to file containing URLs (one per line)")
    parser.add_argument("--delay", type=int, default=30, help="Delay between submissions in seconds")
    parser.add_argument("--output", default="archived_results.csv", help="CSV file to save results")
    args = parser.parse_args()
    urls = read_urls_from_file(args.urls)

    if not urls:
        print("No URLs to archive.")
        return

    session = requests.Session()
    session.headers.update({
        "User-Agent": ""
    })

    with open(args.output, "w", newline='', encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Original URL", "Archived URL"])

        for idx, url in enumerate(urls, 1):
            print(f"\n[{idx}/{len(urls)}]")
            original, archived = archive_url(session, url)
            writer.writerow([original, archived or ""])
            if idx < len(urls):
                print(f"Waiting {args.delay} seconds before next...")
                time.sleep(args.delay)

    print(f"\nFinished. Results saved to {args.output}")

if __name__ == "__main__":
    main()

Thursday, June 12, 2025

Metaphors, Models, and Theories

I've written about metaphors in the past and how they are useful because they allow us to transfer information across unrelated domains. Currently I am reading Julian Jaynes' The Origin of Consciousness, and the topic is brought up and expanded upon very nicely here:

We are trying to understand consciousness, but what are we really trying to do when we try to understand anything? Like children trying to describe nonsense objects, so in trying to understand a thing we are trying to find a metaphor for that thing. Not just any metaphor, but one with something more familiar and easy to our attention. Understanding a thing is to arrive at a metaphor for that thing by substituting something more familiar to us. And the feeling of familiarity is the feeling of understanding.

Generations ago we would understand thunderstorms perhaps as the roaring and rumbling about in battle of superhuman gods. We would have reduced the racket that follows the streak of lightning to familiar battle sounds, for example. Similarly today, we reduce the storm to various supposed experiences with friction, sparks, vacuums, and the imagination of bulgeous banks of burly air smashing together to make the noise. None of these really exist as we picture them. Our images of these events of physics are as far from the actuality as fighting gods. Yet they act as the metaphor and they feel familiar and so we say we understand the thunderstorm.

But metaphors are not necessarily theories. Nor are they empirical explanations of how stuff works. If we were to ask "what is a sunrise?" we might say "Well, a sunrise is the dawn of new day!" And while that may be a useful metaphor, it isn't a proof or explanation of what, exactly, a sunrise is.

A sunrise is when the Sun comes back into the reference frame on our part of Earth—a fact created by the reality that we are hurling through space and gravitating around the Sun every 24 hours. But let us ask another question: why are sunrises and sunsets often red? Or "why is the sky blue?" Of course, the answer is not as simple as merely "being the dawn of a new day."

As I wrote on my other blog about obvious and non-obvious things, and how explanations can be load-bearing or layered:

It seems like there are countless examples where a concept is assumed to be obvious. But when you sit down and really think about it for more than five seconds, you notice that one of the defining characteristics of the concept is that it isn't actually obvious at all.

"Why is the sky blue?"   Well, before you can answer that question, you have to understand light, colors, and their wavelengths (from longest to shortest): red, orange, yellow, green, and blue.

Blue light has the shortest wavelength. Additionally, it’s important to understand the behavior of light. Light can be reflected, bent, or scattered. It can also interact with the atmosphere.

The light we see on Earth is partly an optical phenomenon and partly a molecular one. Gases and particles in the atmosphere cause the light hitting Earth to scatter, which is why we see blue light most of the time.

“But what about sunsets?”  Sunsets are the result of the sun passing at a low angle, causing even more blue light to be scattered while debris and dust particles in the sky reflect more direct sunlight, making the sky appear more red than blue.

And that is the theory of why sunsets and sunrises are red -- and why the sky is blue. And I've written this out to point at how a theory is distinctly different than a metaphor. But also different than a model.

Julian Jaynes goes on to illustrate this, distinguishing and drawing relations between the concepts of metaphors, models, and theories:

So, in other areas of science, we say we understand an aspect of nature when we can say it is similar to some familiar theoretical model. The terms theory and model, incidentally, are sometimes used interchangeably. But really they should not be. A theory is a relationship of the model to the things the model is supposed to represent. The Bohr model of the atom is that of a proton surrounded by orbiting electrons. It is something like the pattern of the solar system, and that is indeed one of its metaphoric sources. Bohr’s theory was that all atoms were similar to his model. The theory, with the more recent discovery of new particles and complicated interatomic relationships, has turned out not to be true. But the model remains. A model is neither true nor false; only the theory of its similarity to what it represents.

A theory is thus a metaphor between a model and data. And understanding in science is the feeling of similarity between complicated data and a familiar model.

Thursday, May 29, 2025

Toast Notifications from PowerShell or C++

I’m currently working on a project that involves sending alerts and notifications to users on Windows 11 systems.

During development, I learned that--for local testing purposes--it’s possible to generate toast notifications using built-in PowerShell functionality. Specifically, the ToastNotificationManager and CreateToastNotifier APIs make it straightforward to display dead simple, native notifications without any external dependencies.

$body = 'Hello from PowerShell! Behold, a toast notification.'

$toastXml = [Windows.UI.Notifications.ToastNotificationManager, Windows.UI.Notifications, ContentType = WindowsRuntime]::GetTemplateContent([Windows.UI.Notifications.ToastTemplateType]::ToastText01)

$toastXml.SelectSingleNode('//text[@id="1"]').InnerText = $body

$appId = 'App'

$toast = [Windows.UI.Notifications.ToastNotification]::new($toastXml)
[Windows.UI.Notifications.ToastNotificationManager]::CreateToastNotifier($appId).Show($toast)

Of course, you can also set up toast notifications with C++ in a Win32 shell environment, too. But Windows will only send toast notifications for apps that have both a shortcut in the start menu, and an AppUserModelID property within that shortcut!

To do this, we can also use a PowerShell script to:

1: Create a Windows shortcut .lnk file
2: Set the AppUserModelID property on that shortcut
3: Save it to disk

First, we set up our shortcut path, the target binary path, and define an AppUserModelID, then use PowerShell's built-in .NET to include functionality for interop services and COM objects.

So, we instantiate a new COM object using the correct interface GUID (which you can find on Pinvoke.net), and create a pointer to it with var link = (IShellLinkW)new ShellLink();. Next, we cast it to IPropertyStore so we can set properties: var store = (IPropertyStore)link; followed by store.SetValue(ref key, ref pv);. Then we set up the required COM structs — PROPERTYKEY to identify the property, and PROPVARIANT to hold the value. And once all the properties are set, we save the shortcut to disk via (IPersistFile)link;.


$ShortcutPath = "$env:APPDATA\Microsoft\Windows\Start Menu\Programs\ToastyApp.lnk"
$TargetPath = "C:\Path\To\App.exe"
$AppUserModelID = "App.ID"

Add-Type -TypeDefinition @"
using System;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;

[ComImport]
[Guid("00021401-0000-0000-C000-000000000046")]
class ShellLink {}

[ComImport]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
[Guid("000214F9-0000-0000-C000-000000000046")]
interface IShellLinkW {
    void GetPath([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszFile, int cchMaxPath, out IntPtr pfd, int fFlags);
    void GetIDList(out IntPtr ppidl);
    void SetIDList(IntPtr pidl);
    void GetDescription([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszName, int cchMaxName);
    void SetDescription([MarshalAs(UnmanagedType.LPWStr)] string pszName);
    void GetWorkingDirectory([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszDir, int cchMaxPath);
    void SetWorkingDirectory([MarshalAs(UnmanagedType.LPWStr)] string pszDir);
    void GetArguments([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszArgs, int cchMaxPath);
    void SetArguments([MarshalAs(UnmanagedType.LPWStr)] string pszArgs);
    void GetHotkey(out short pwHotkey);
    void SetHotkey(short wHotkey);
    void GetShowCmd(out int piShowCmd);
    void SetShowCmd(int iShowCmd);
    void GetIconLocation([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszIconPath, int cchIconPath, out int piIcon);
    void SetIconLocation([MarshalAs(UnmanagedType.LPWStr)] string pszIconPath, int iIcon);
    void SetRelativePath([MarshalAs(UnmanagedType.LPWStr)] string pszPathRel, int dwReserved);
    void Resolve(IntPtr hwnd, int fFlags);
    void SetPath([MarshalAs(UnmanagedType.LPWStr)] string pszFile);
}

[ComImport]
[Guid("886D8EEB-8CF2-4446-8D02-CDBA1DBDCF99")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPropertyStore {
    void GetCount(out uint cProps);
    void GetAt(uint iProp, out PROPERTYKEY pkey);
    void GetValue(ref PROPERTYKEY key, out PROPVARIANT pv);
    void SetValue(ref PROPERTYKEY key, ref PROPVARIANT pv);
    void Commit();
}

[StructLayout(LayoutKind.Sequential, Pack = 4)]
struct PROPERTYKEY {
    public Guid fmtid;
    public uint pid;
}

[StructLayout(LayoutKind.Explicit)]
struct PROPVARIANT {
    [FieldOffset(0)]
    public ushort vt;
    [FieldOffset(8)]
    public IntPtr pszVal;

    public static PROPVARIANT FromString(string value) {
        var pv = new PROPVARIANT();
        pv.vt = 31; // VT_LPWSTR
        pv.pszVal = Marshal.StringToCoTaskMemUni(value);
        return pv;
    }
}

public static class ShellLinkHelper {
    static readonly Guid PKEY_AppUserModel_ID_fmtid = new Guid("9F4C2855-9F79-4B39-A8D0-E1D42DE1D5F3");
    const uint PKEY_AppUserModel_ID_pid = 5;

    public static void CreateShortcut(string shortcutPath, string exePath, string appId) {
        var link = (IShellLinkW)new ShellLink();
        link.SetPath(exePath);

        var store = (IPropertyStore)link;
        var key = new PROPERTYKEY() { fmtid = PKEY_AppUserModel_ID_fmtid, pid = PKEY_AppUserModel_ID_pid };
        var pv = PROPVARIANT.FromString(appId);

        store.SetValue(ref key, ref pv);
        store.Commit();

        var file = (IPersistFile)link;
        file.Save(shortcutPath, false);
    }
}
"@ -Language CSharp

# Call helper from PowerShell
[ShellLinkHelper]::CreateShortcut($ShortcutPath, $TargetPath, $AppUserModelID)
Write-Host "Shortcut created at $ShortcutPath with AppUserModelID = $AppUserModelID"

With our shortcut and AppID properly set up, we can use the following C++ for a bare bones Toast Notification test. After compiling our C++ program below, we will return to the .lnk shortcut we created with PowerShell at %APPDATA%\Microsoft\Windows\Start Menu\Programs\ToastyApp.lnk, to make one small change--configuring its properties to point to wherever our compiled C++ binary is.

To ensure the following C++ code compiles, you will need to open Visual Studio Community and click Projects -> Properties -> Linker -> Input and manually add "runtimeobject.lib" to your additional dependencies.

Additionally, this build only compiles using the ISO C++17 Standard. C++17 is mandatory for building the code below. You may configure your project to use the standard within the C/C++ -> Language selector in the same Project Properties dialogue as mentioned above.


#include <windows.h>
#include <wrl/client.h>
#include <wrl/wrappers/corewrappers.h>
#include <windows.ui.notifications.h>
#include <winrt/base.h>
#include <winrt/Windows.Data.Xml.Dom.h>
#include <winrt/Windows.UI.Notifications.h>
#include <string>
#include <iostream>
#include <shobjidl.h>
#pragma comment(lib, "Shell32.lib")
using namespace Microsoft::WRL;
using namespace Microsoft::WRL::Wrappers;
using namespace winrt;
using namespace winrt::Windows::Data::Xml::Dom;
using namespace winrt::Windows::UI::Notifications;

int main() {
    RoInitialize(RO_INIT_SINGLETHREADED);

    // Set AppUserModelID
    SetCurrentProcessExplicitAppUserModelID(L"Your.App.ID");

    // Create Toast Notifier
    auto toastNotifier = ToastNotificationManager::CreateToastNotifier(L"Your.App.ID");

    // Create XML content
    XmlDocument toastXml;
    try {
        std::wstring xmlString = L"Hello from C++!";
        toastXml.LoadXml(xmlString);
    }
    catch (const hresult_error& ex) {
        std::wcerr << L"Failed to load XML: " << ex.message().c_str() << std::endl;
        RoUninitialize();
        return 1;
    }

    // Create Toast Notification
    auto toast = ToastNotification(toastXml);

    // Show Toast
    toastNotifier.Show(toast);


    RoUninitialize();
    return 0;
}

Source code for Toast Notifications on Windows 11.

Friday, May 16, 2025

A Security Trilemma

Playing around with writing malware proof-of-concepts, running red and blue team simulations in my computer lab against Windows Home edition, I feel sort of bad for Windows Home users.

Such users probably constitute the majority of Microsoft's userbase. And most security mitigations for that edition are not exactly effective against attackers.

Commercial-grade versions of Windows and commercial-grade security products are a different story in some circumstances. Commercial editions of Windows include a lot of nice mitigations and security features. But I think it's kind of an economic trilemma.

You have three potential strategies for security--and a few different potential tradeoffs. You can only optimize for two out of three.

  • If it's cheap and convenient, it won't be secure.
  • If it's cheap and secure, it won't be convenient.
  • If it's secure and convenient, it won't be cheap.

There are certainly exceptions to this model, though. For example, think about open-source, end-to-end encrypted messaging apps. Some of those feel like very unlikely tail distributions, where, to some extent, the solutions provide all of the above: they're cheap, secure, and convenient.

Monday, March 31, 2025

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text.

Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today.

A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages.

The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and expect it to work. Parsing the UTF8 encoding scheme can be irritatingly problematic or even have security implications.

sequences = [
    b'\x41\x42\x43',  # valid in Latin1 and UTF8
    b'\xe2\x82\xac',  # valid Latin1 and UTF8
    b'\x80\x81\x82',  # valid in Latin1, invalid in UTF8 
    b'\x41\x42\x80',  # valid in Latin1, invalid in UTF8
]

def decode(sequences):
    for seq in sequences:
        print(f"Decoding: {seq}")

        try:
            latin1_decoded = seq.decode('latin1')
            print(f"  Decoded with Latin1: {latin1_decoded}")
        except Exception as e:
            print(f"  Error decoding with Latin1: {e}")

        try:
            utf8_decoded = seq.decode('utf-8')
            print(f"  Decoded with UTF8: {utf8_decoded}")
        except Exception as e:
            print(f"  Error decoding with UTF8: {e}")

decode(sequences)

$ python3 latin.py
Decoding: b'ABC'
  ...with Latin1: ABC
  ...with UTF8: ABC
Decoding: b'\xe2\x82\xac'
  ...with Latin1: รข‚¬
  ...with UTF8: €
Decoding: b'\x80\x81\x82'
  ...with Latin1: €‚
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Decoding: b'AB\x80'
  ...with Latin1: AB€
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 2: invalid start byte

Sunday, March 30, 2025

Too much efficiency makes everything worse

From "Overfitting and the strong version of Goodhart's law":
Increased efficiency can sometimes, counterintuitively, lead to worse outcomes. This is true almost everywhere. We will name this phenomenon the strong version of Goodhart's law. As one example, more efficient centralized tracking of student progress by standardized testing seems like such a good idea that well-intentioned laws mandate it. However, testing also incentivizes schools to focus more on teaching students to test well, and less on teaching broadly useful skills. As a result, it can cause overall educational outcomes to become worse. Similar examples abound, in politics, economics, health, science, and many other fields.

[...] This same counterintuitive relationship between efficiency and outcome occurs in machine learning, where it is called overfitting. [...] If we keep on optimizing the proxy objective, even after our goal stops improving, something more worrying happens. The goal often starts getting worse, even as our proxy objective continues to improve. Not just a little bit worse either — often the goal will diverge towards infinity.

This is an extremely general phenomenon in machine learning. It mostly doesn’t matter what our goal and proxy are, or what model architecture we use. If we are very efficient at optimizing a proxy, then we make the thing it is a proxy for grow worse.

Saturday, March 29, 2025

Bootstrapping and the Central Limit Theorem

If you've ever seen a data visualization, you've probably seen a Bell Curve or a normal distribution. But this emergent property of many data visualizations is actually a result of the law of large numbers and the central limit theorem.

The central limit theorem tells us that the distribution of a normalized version of any sample mean will eventually converge to a standard normal distribution.

For example, let's say that we wish to chart the first fifty most popular science-fiction books on Goodreads by the number of pages they contain.

Our initial sample will look something like this:

pageCounts = np.array([
    324, 216, 384, 194, 480, 368, 374, 268, 244, 258, 
    476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
    382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
    250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])

If we want to plot our original sample of books, we could do something like:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pageCounts = np.array([
    324, 216, 384, 194, 480, 368, 374, 268, 244, 258,
    476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
    382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
    250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])

plt.figure(figsize=(7, 5))
sns.histplot(page_counts, bins=10, kde=False, color='#1f77b4', edgecolor='black')
plt.title('Histogram of Book Pages')
plt.xlabel('Page Count')
plt.ylabel('Frequency')
plt.savefig("histogram.jpg", dpi=300, bbox_inches='tight')
plt.close()

This will produce a chart like:

But if we want to normalize and bootstrap our dataset, we will have to resample it. Replacement sampling, which we will use in this example, works like this. Let us say that we have a data set of only:

pageCounts = np.array([
	216, 324, 385
])

The resampling process will randomly sample from this set. For example:

  • Resample #1: [216, 324, 324] -> mean = 288.0
  • Resample #2: [385, 385, 216] -> mean = 328.67
  • Resample #3: [324, 216, 216] -> mean = 252.0

If we repeat this process many times, the distribution of our resampled means will approximate a normal distribution, as predicted by the Central Limit Theorem. We can append the following Python code to bootstrap our dataset and graph it:

np.random.seed(42)
num_samples = 10000
bootstrap_means = np.random.choice(page_counts, (num_samples, len(page_counts)),
replace=True).mean(axis=1)

plt.figure(figsize=(7, 5))
sns.histplot(bootstrap_means, bins=30, kde=True, color='#ff7f0e', edgecolor='black')
plt.title('Bootstrapped Distribution of Page Counts')
plt.xlabel('Mean Page Count')
plt.ylabel('Frequency')
plt.savefig("bootstrapped_distribution.jpg", dpi=300, bbox_inches='tight')  
plt.close() 

This process is extremely useful for both modeling and hypothesis testing. If we want to make a claim about a dataset, such as page counts of science fiction books — but we only get a small sample of science fiction books to work with—we can use bootstrapping to generate many simulations of the dataset and sample the distribution of the statistic we want to inquire about.

It's important to note that resampling isn't done to estimate the distribution—our sample itself already represents a data model. In this case, it represents page counts of science fiction books.

Rather, by resampling, we approximate the sampling distribution of a given statistic, such as the mean. This may allow us to make inferences about the broader dataset, even when the original sample size is small.

For example, we could additionally assess confidence intervals, which we'll discuss in a future post.

Using Python To Access archive.today, July 2025

It seems like a lot of the previous software wrappers to interact with archive.today (and archive.is, archive.ph, etc) via the command-line ...