Stephan Bridger

Thursday, July 24, 2025

Using Python To Access archive.today, July 2025

It seems like a lot of the previous software wrappers to interact with archive.today (and archive.is, archive.ph, etc) via the command-line are either outdated or broken. So, here's a Python script to automatically submit links from the command-line to archive.today and retrieve their archived URLs.

From testing, it seems like it's best to keep the delay around 8 to 10 seconds. If you go too fast, Cloudflare will begin to yell at you and start throwing 429 errors.

As long as you've received a "WIP" URL from archive.today, it should be archived shortly after, though it may not appear immediately.

Add your own random user-agent. :)

'''

% python3 archiveToday.py --help
usage: archiveToday.py [-h] --urls URLS [--delay DELAY] [--output OUTPUT]
Batch archive URLs with archive.today
options:
  -h, --help       show this help message and exit
  --urls URLS      Path to file containing URLs (one per line)
  --delay DELAY    Delay between submissions in seconds
  --output OUTPUT  CSV file to save results
  
'''


import requests
import time
import os
import argparse
import csv
import re
from bs4 import BeautifulSoup  

def archive_url(session, url):
    try:
        print(f"Archiving: {url}")
        resp = session.get("https://archive.ph/submit/", params={"url": url}, allow_redirects=False)

        # If already archived, follow 302 redirect
        if resp.status_code == 302:
            archived_url = resp.headers.get("Location")
            # Match both 4 and 5 character archive slugs
            match = re.match(r"(https://archive\.ph/\w{4,5})", archived_url)
            if match:
                archived_url = match.group(1)
            print(f"Already archived: {archived_url}")
            return url, archived_url

        # If needs archiving, follow refresh to /wip/
        if resp.status_code == 200:
            refresh_header = resp.headers.get("refresh", "")
            match = re.search(r'url=(https?://[^\s]+)', refresh_header)
            if not match:
                print("WIP URL not found in refresh header.")
                return url, None

            wip_url = match.group(1)
            print(f"Archiving in progress (WIP): {wip_url}")

            final_resp = session.get(wip_url, allow_redirects=True)
            if final_resp.status_code == 200:
                archived_url = final_resp.url.replace("/wip/", "/")
                print(f"Archived: {archived_url}")
                return url, archived_url
            else:
                print(f"Failed to retrieve from WIP URL. Status: {final_resp.status_code}")
                return url, None

        print(f"Unexpected status code: {resp.status_code}")
        return url, None

    except Exception as e:
        print(f"Exception archiving {url}: {e}")
        return url, None

def read_urls_from_file(filename):
    if not os.path.exists(filename):
        print(f"File not found: {filename}")
        return []

    with open(filename, 'r') as f:
        urls = [
            line.strip()
            for line in f
            if line.strip() and not line.strip().startswith("#")
        ]
    return urls


def main():
    parser = argparse.ArgumentParser(description="Batch archive URLs with archive.today")
    parser.add_argument("--urls", required=True, help="Path to file containing URLs (one per line)")
    parser.add_argument("--delay", type=int, default=30, help="Delay between submissions in seconds")
    parser.add_argument("--output", default="archived_results.csv", help="CSV file to save results")
    args = parser.parse_args()
    urls = read_urls_from_file(args.urls)

    if not urls:
        print("No URLs to archive.")
        return

    session = requests.Session()
    session.headers.update({
        "User-Agent": ""
    })

    with open(args.output, "w", newline='', encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Original URL", "Archived URL"])

        for idx, url in enumerate(urls, 1):
            print(f"\n[{idx}/{len(urls)}]")
            original, archived = archive_url(session, url)
            writer.writerow([original, archived or ""])
            if idx < len(urls):
                print(f"Waiting {args.delay} seconds before next...")
                time.sleep(args.delay)

    print(f"\nFinished. Results saved to {args.output}")

if __name__ == "__main__":
    main()

Thursday, June 12, 2025

Metaphors, Models, and Theories

I've written about metaphors in the past and how they are useful because they allow us to transfer information across unrelated domains. Currently I am reading Julian Jaynes' The Origin of Consciousness, and the topic is brought up and expanded upon very nicely here:

We are trying to understand consciousness, but what are we really trying to do when we try to understand anything? Like children trying to describe nonsense objects, so in trying to understand a thing we are trying to find a metaphor for that thing. Not just any metaphor, but one with something more familiar and easy to our attention. Understanding a thing is to arrive at a metaphor for that thing by substituting something more familiar to us. And the feeling of familiarity is the feeling of understanding.

Generations ago we would understand thunderstorms perhaps as the roaring and rumbling about in battle of superhuman gods. We would have reduced the racket that follows the streak of lightning to familiar battle sounds, for example. Similarly today, we reduce the storm to various supposed experiences with friction, sparks, vacuums, and the imagination of bulgeous banks of burly air smashing together to make the noise. None of these really exist as we picture them. Our images of these events of physics are as far from the actuality as fighting gods. Yet they act as the metaphor and they feel familiar and so we say we understand the thunderstorm.

But metaphors are not necessarily theories. Nor are they empirical explanations of how stuff works. If we were to ask "what is a sunrise?" we might say "Well, a sunrise is the dawn of new day!" And while that may be a useful metaphor, it isn't a proof or explanation of what, exactly, a sunrise is.

A sunrise is when the Sun comes back into the reference frame on our part of Earth—a fact created by the reality that we are hurling through space and gravitating around the Sun every 24 hours. But let us ask another question: why are sunrises and sunsets often red? Or "why is the sky blue?" Of course, the answer is not as simple as merely "being the dawn of a new day."

As I wrote on my other blog about obvious and non-obvious things, and how explanations can be load-bearing or layered:

It seems like there are countless examples where a concept is assumed to be obvious. But when you sit down and really think about it for more than five seconds, you notice that one of the defining characteristics of the concept is that it isn't actually obvious at all.

"Why is the sky blue?" Well, before you can answer that question, you have to understand light, colors, and their wavelengths (from longest to shortest): red, orange, yellow, green, and blue.

Blue light has the shortest wavelength. Additionally, it’s important to understand the behavior of light. Light can be reflected, bent, or scattered. It can also interact with the atmosphere.

The light we see on Earth is partly an optical phenomenon and partly a molecular one. Gases and particles in the atmosphere cause the light hitting Earth to scatter, which is why we see blue light most of the time.

“But what about sunsets?” Sunsets are the result of the sun passing at a low angle, causing even more blue light to be scattered while debris and dust particles in the sky reflect more direct sunlight, making the sky appear more red than blue.

And that is the theory of why sunsets and sunrises are red -- and why the sky is blue. And I've written this out to point at how a theory is distinctly different than a metaphor. But also different than a model.

Julian Jaynes goes on to illustrate this, distinguishing and drawing relations between the concepts of metaphors, models, and theories:

So, in other areas of science, we say we understand an aspect of nature when we can say it is similar to some familiar theoretical model. The terms theory and model, incidentally, are sometimes used interchangeably. But really they should not be. A theory is a relationship of the model to the things the model is supposed to represent. The Bohr model of the atom is that of a proton surrounded by orbiting electrons. It is something like the pattern of the solar system, and that is indeed one of its metaphoric sources. Bohr’s theory was that all atoms were similar to his model. The theory, with the more recent discovery of new particles and complicated interatomic relationships, has turned out not to be true. But the model remains. A model is neither true nor false; only the theory of its similarity to what it represents.

A theory is thus a metaphor between a model and data. And understanding in science is the feeling of similarity between complicated data and a familiar model.

Thursday, May 29, 2025

Toast Notifications from PowerShell or C++

I’m currently working on a project that involves sending alerts and notifications to users on Windows 11 systems.

During development, I learned that--for local testing purposes--it’s possible to generate toast notifications using built-in PowerShell functionality. Specifically, the ToastNotificationManager and CreateToastNotifier APIs make it straightforward to display dead simple, native notifications without any external dependencies.

$body = 'Hello from PowerShell! Behold, a toast notification.'

$toastXml = [Windows.UI.Notifications.ToastNotificationManager, Windows.UI.Notifications, ContentType = WindowsRuntime]::GetTemplateContent([Windows.UI.Notifications.ToastTemplateType]::ToastText01)

$toastXml.SelectSingleNode('//text[@id="1"]').InnerText = $body

$appId = 'App'

$toast = [Windows.UI.Notifications.ToastNotification]::new($toastXml)
[Windows.UI.Notifications.ToastNotificationManager]::CreateToastNotifier($appId).Show($toast)

Of course, you can also set up toast notifications with C++ in a Win32 shell environment, too. But Windows will only send toast notifications for apps that have both a shortcut in the start menu, and an AppUserModelID property within that shortcut!

To do this, we can also use a PowerShell script to:

1: Create a Windows shortcut .lnk file
2: Set the AppUserModelID property on that shortcut
3: Save it to disk

First, we set up our shortcut path, the target binary path, and define an AppUserModelID, then use PowerShell's built-in .NET to include functionality for interop services and COM objects.

So, we instantiate a new COM object using the correct interface GUID (which you can find on Pinvoke.net), and create a pointer to it with var link = (IShellLinkW)new ShellLink();. Next, we cast it to IPropertyStore so we can set properties: var store = (IPropertyStore)link; followed by store.SetValue(ref key, ref pv);. Then we set up the required COM structs — PROPERTYKEY to identify the property, and PROPVARIANT to hold the value. And once all the properties are set, we save the shortcut to disk via (IPersistFile)link;.


$ShortcutPath = "$env:APPDATA\Microsoft\Windows\Start Menu\Programs\ToastyApp.lnk"
$TargetPath = "C:\Path\To\App.exe"
$AppUserModelID = "App.ID"

Add-Type -TypeDefinition @"
using System;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;

[ComImport]
[Guid("00021401-0000-0000-C000-000000000046")]
class ShellLink {}

[ComImport]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
[Guid("000214F9-0000-0000-C000-000000000046")]
interface IShellLinkW {
    void GetPath([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszFile, int cchMaxPath, out IntPtr pfd, int fFlags);
    void GetIDList(out IntPtr ppidl);
    void SetIDList(IntPtr pidl);
    void GetDescription([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszName, int cchMaxName);
    void SetDescription([MarshalAs(UnmanagedType.LPWStr)] string pszName);
    void GetWorkingDirectory([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszDir, int cchMaxPath);
    void SetWorkingDirectory([MarshalAs(UnmanagedType.LPWStr)] string pszDir);
    void GetArguments([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszArgs, int cchMaxPath);
    void SetArguments([MarshalAs(UnmanagedType.LPWStr)] string pszArgs);
    void GetHotkey(out short pwHotkey);
    void SetHotkey(short wHotkey);
    void GetShowCmd(out int piShowCmd);
    void SetShowCmd(int iShowCmd);
    void GetIconLocation([Out, MarshalAs(UnmanagedType.LPWStr)] System.Text.StringBuilder pszIconPath, int cchIconPath, out int piIcon);
    void SetIconLocation([MarshalAs(UnmanagedType.LPWStr)] string pszIconPath, int iIcon);
    void SetRelativePath([MarshalAs(UnmanagedType.LPWStr)] string pszPathRel, int dwReserved);
    void Resolve(IntPtr hwnd, int fFlags);
    void SetPath([MarshalAs(UnmanagedType.LPWStr)] string pszFile);
}

[ComImport]
[Guid("886D8EEB-8CF2-4446-8D02-CDBA1DBDCF99")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPropertyStore {
    void GetCount(out uint cProps);
    void GetAt(uint iProp, out PROPERTYKEY pkey);
    void GetValue(ref PROPERTYKEY key, out PROPVARIANT pv);
    void SetValue(ref PROPERTYKEY key, ref PROPVARIANT pv);
    void Commit();
}

[StructLayout(LayoutKind.Sequential, Pack = 4)]
struct PROPERTYKEY {
    public Guid fmtid;
    public uint pid;
}

[StructLayout(LayoutKind.Explicit)]
struct PROPVARIANT {
    [FieldOffset(0)]
    public ushort vt;
    [FieldOffset(8)]
    public IntPtr pszVal;

    public static PROPVARIANT FromString(string value) {
        var pv = new PROPVARIANT();
        pv.vt = 31; // VT_LPWSTR
        pv.pszVal = Marshal.StringToCoTaskMemUni(value);
        return pv;
    }
}

public static class ShellLinkHelper {
    static readonly Guid PKEY_AppUserModel_ID_fmtid = new Guid("9F4C2855-9F79-4B39-A8D0-E1D42DE1D5F3");
    const uint PKEY_AppUserModel_ID_pid = 5;

    public static void CreateShortcut(string shortcutPath, string exePath, string appId) {
        var link = (IShellLinkW)new ShellLink();
        link.SetPath(exePath);

        var store = (IPropertyStore)link;
        var key = new PROPERTYKEY() { fmtid = PKEY_AppUserModel_ID_fmtid, pid = PKEY_AppUserModel_ID_pid };
        var pv = PROPVARIANT.FromString(appId);

        store.SetValue(ref key, ref pv);
        store.Commit();

        var file = (IPersistFile)link;
        file.Save(shortcutPath, false);
    }
}
"@ -Language CSharp

# Call helper from PowerShell
[ShellLinkHelper]::CreateShortcut($ShortcutPath, $TargetPath, $AppUserModelID)
Write-Host "Shortcut created at $ShortcutPath with AppUserModelID = $AppUserModelID"

With our shortcut and AppID properly set up, we can use the following C++ for a bare bones Toast Notification test. After compiling our C++ program below, we will return to the .lnk shortcut we created with PowerShell at %APPDATA%\Microsoft\Windows\Start Menu\Programs\ToastyApp.lnk, to make one small change--configuring its properties to point to wherever our compiled C++ binary is.

To ensure the following C++ code compiles, you will need to open Visual Studio Community and click Projects -> Properties -> Linker -> Input and manually add "runtimeobject.lib" to your additional dependencies.

Additionally, this build only compiles using the ISO C++17 Standard. C++17 is mandatory for building the code below. You may configure your project to use the standard within the C/C++ -> Language selector in the same Project Properties dialogue as mentioned above.


#include <windows.h>
#include <wrl/client.h>
#include <wrl/wrappers/corewrappers.h>
#include <windows.ui.notifications.h>
#include <winrt/base.h>
#include <winrt/Windows.Data.Xml.Dom.h>
#include <winrt/Windows.UI.Notifications.h>
#include <string>
#include <iostream>
#include <shobjidl.h>
#pragma comment(lib, "Shell32.lib")
using namespace Microsoft::WRL;
using namespace Microsoft::WRL::Wrappers;
using namespace winrt;
using namespace winrt::Windows::Data::Xml::Dom;
using namespace winrt::Windows::UI::Notifications;

int main() {
    RoInitialize(RO_INIT_SINGLETHREADED);

    // Set AppUserModelID
    SetCurrentProcessExplicitAppUserModelID(L"Your.App.ID");

    // Create Toast Notifier
    auto toastNotifier = ToastNotificationManager::CreateToastNotifier(L"Your.App.ID");

    // Create XML content
    XmlDocument toastXml;
    try {
        std::wstring xmlString = L"Hello from C++!";
        toastXml.LoadXml(xmlString);
    }
    catch (const hresult_error& ex) {
        std::wcerr << L"Failed to load XML: " << ex.message().c_str() << std::endl;
        RoUninitialize();
        return 1;
    }

    // Create Toast Notification
    auto toast = ToastNotification(toastXml);

    // Show Toast
    toastNotifier.Show(toast);


    RoUninitialize();
    return 0;
}

Source code for Toast Notifications on Windows 11.

Friday, May 16, 2025

A Security Trilemma

Playing around with writing malware proof-of-concepts, running red and blue team simulations in my computer lab against Windows Home edition, I feel sort of bad for Windows Home users.

Such users probably constitute the majority of Microsoft's userbase. And most security mitigations for that edition are not exactly effective against attackers.

Commercial-grade versions of Windows and commercial-grade security products are a different story in some circumstances. Commercial editions of Windows include a lot of nice mitigations and security features. But I think it's kind of an economic trilemma.

You have three potential strategies for security--and a few different potential tradeoffs. You can only optimize for two out of three.

If it's cheap and convenient, it won't be secure.
If it's cheap and secure, it won't be convenient.
If it's secure and convenient, it won't be cheap.

There are certainly exceptions to this model, though. For example, think about open-source, end-to-end encrypted messaging apps. Some of those feel like very unlikely tail distributions, where, to some extent, the solutions provide all of the above: they're cheap, secure, and convenient.

Monday, March 31, 2025

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text.

Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today.

A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages.

The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and expect it to work. Parsing the UTF8 encoding scheme can be irritatingly problematic or even have security implications.

sequences = [
    b'\x41\x42\x43',  # valid in Latin1 and UTF8
    b'\xe2\x82\xac',  # valid Latin1 and UTF8
    b'\x80\x81\x82',  # valid in Latin1, invalid in UTF8 
    b'\x41\x42\x80',  # valid in Latin1, invalid in UTF8
]

def decode(sequences):
    for seq in sequences:
        print(f"Decoding: {seq}")

        try:
            latin1_decoded = seq.decode('latin1')
            print(f"  Decoded with Latin1: {latin1_decoded}")
        except Exception as e:
            print(f"  Error decoding with Latin1: {e}")

        try:
            utf8_decoded = seq.decode('utf-8')
            print(f"  Decoded with UTF8: {utf8_decoded}")
        except Exception as e:
            print(f"  Error decoding with UTF8: {e}")

decode(sequences)

$ python3 latin.py
Decoding: b'ABC'
  ...with Latin1: ABC
  ...with UTF8: ABC
Decoding: b'\xe2\x82\xac'
  ...with Latin1: â‚¬
  ...with UTF8: €
Decoding: b'\x80\x81\x82'
  ...with Latin1: €‚
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Decoding: b'AB\x80'
  ...with Latin1: AB€
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 2: invalid start byte

Sunday, March 30, 2025

Too much efficiency makes everything worse

From "Overfitting and the strong version of Goodhart's law":

Increased efficiency can sometimes, counterintuitively, lead to worse outcomes. This is true almost everywhere. We will name this phenomenon the strong version of Goodhart's law. As one example, more efficient centralized tracking of student progress by standardized testing seems like such a good idea that well-intentioned laws mandate it. However, testing also incentivizes schools to focus more on teaching students to test well, and less on teaching broadly useful skills. As a result, it can cause overall educational outcomes to become worse. Similar examples abound, in politics, economics, health, science, and many other fields.

[...] This same counterintuitive relationship between efficiency and outcome occurs in machine learning, where it is called overfitting. [...] If we keep on optimizing the proxy objective, even after our goal stops improving, something more worrying happens. The goal often starts getting worse, even as our proxy objective continues to improve. Not just a little bit worse either — often the goal will diverge towards infinity.

This is an extremely general phenomenon in machine learning. It mostly doesn’t matter what our goal and proxy are, or what model architecture we use. If we are very efficient at optimizing a proxy, then we make the thing it is a proxy for grow worse.

Saturday, March 29, 2025

Bootstrapping and the Central Limit Theorem

If you've ever seen a data visualization, you've probably seen a Bell Curve or a normal distribution. But this emergent property of many data visualizations is actually a result of the law of large numbers and the central limit theorem.

The central limit theorem tells us that the distribution of a normalized version of any sample mean will eventually converge to a standard normal distribution.

For example, let's say that we wish to chart the first fifty most popular science-fiction books on Goodreads by the number of pages they contain.

Our initial sample will look something like this:

pageCounts = np.array([
    324, 216, 384, 194, 480, 368, 374, 268, 244, 258, 
    476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
    382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
    250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])

If we want to plot our original sample of books, we could do something like:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pageCounts = np.array([
    324, 216, 384, 194, 480, 368, 374, 268, 244, 258,
    476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
    382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
    250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])

plt.figure(figsize=(7, 5))
sns.histplot(page_counts, bins=10, kde=False, color='#1f77b4', edgecolor='black')
plt.title('Histogram of Book Pages')
plt.xlabel('Page Count')
plt.ylabel('Frequency')
plt.savefig("histogram.jpg", dpi=300, bbox_inches='tight')
plt.close()

This will produce a chart like:

But if we want to normalize and bootstrap our dataset, we will have to resample it. Replacement sampling, which we will use in this example, works like this. Let us say that we have a data set of only:

pageCounts = np.array([
	216, 324, 385
])

The resampling process will randomly sample from this set. For example:

Resample #1: [216, 324, 324] -> mean = 288.0
Resample #2: [385, 385, 216] -> mean = 328.67
Resample #3: [324, 216, 216] -> mean = 252.0

If we repeat this process many times, the distribution of our resampled means will approximate a normal distribution, as predicted by the Central Limit Theorem. We can append the following Python code to bootstrap our dataset and graph it:

np.random.seed(42)
num_samples = 10000
bootstrap_means = np.random.choice(page_counts, (num_samples, len(page_counts)),
replace=True).mean(axis=1)

plt.figure(figsize=(7, 5))
sns.histplot(bootstrap_means, bins=30, kde=True, color='#ff7f0e', edgecolor='black')
plt.title('Bootstrapped Distribution of Page Counts')
plt.xlabel('Mean Page Count')
plt.ylabel('Frequency')
plt.savefig("bootstrapped_distribution.jpg", dpi=300, bbox_inches='tight')  
plt.close()

This process is extremely useful for both modeling and hypothesis testing. If we want to make a claim about a dataset, such as page counts of science fiction books — but we only get a small sample of science fiction books to work with—we can use bootstrapping to generate many simulations of the dataset and sample the distribution of the statistic we want to inquire about.

It's important to note that resampling isn't done to estimate the distribution—our sample itself already represents a data model. In this case, it represents page counts of science fiction books.

Rather, by resampling, we approximate the sampling distribution of a given statistic, such as the mean. This may allow us to make inferences about the broader dataset, even when the original sample size is small.

For example, we could additionally assess confidence intervals, which we'll discuss in a future post.

Wednesday, March 26, 2025

Unlearning, or Proof by Contradiction

Sometimes, we have to unlearn the things we initially learned. And I don't mean this in the sense of having been deliberately deceived. Rather, I mean that to some extent, there are actually many situations in life that involve necessary lies—or believing things that are wrong for perfectly rational reasons. Sometimes it is only after we have consumed and digested such a falsehood that we can see the truth at all. Really, this form of learning is not unlike some parts of math.

Consider a mathematical proof in which we begin by assuming that something is one way. But by the end of the proof, we may realize, through contradiction, that it's actually another way.

Let us take the number 2 and generously hypothesize that the square root of 2 is actually rational. If this assumption were true, we should be able to prove it with an equation. Let the square root of 2 be the lowest form of $\frac{p}{q}$.

Since squares of even numbers are even, and squares of odd numbers are odd, it follows that in order to get back to two, we would have to raise both p and q to the power of 2, like this:

\[ 2 = \frac{p^2}{q^2} \]

If we multiply both sides by $q^2$, we can get rid of the denominator. Now we get $ p^2 = 2q^2 $. From here, we must infer that $p$ is an even number.

With our generous assumption that $p$ is even, let us substitute $p$ for $2r$, where r is an integer. Let us test our hypothesis with the following equation, simplifying both sides:

\[ (2r)^2 = 2q^2 \] \[ 4r^2 = 2q^2 \]

Uh oh. Now we've hit a snag. From here, if we attempt to divide by two on both sides, we get:

\[ 2r^2 = q^2 \]

And we cannot further divide or simplify our equation. At least, we can not do so within the realm of rational numbers!

How can this be? Remember, our initial hypothesis that the square root of 2 was rational rested on the assumption that $\frac{p}{q}$ was in its lowest form. But now here we see that $2r^2$ is equal to $q^2$. In other words, both $p$ and $q$ are divisible by 2—which contradicts our original claim that $\frac{p}{q}$ was in lowest terms.

This means they still share a common factor. Thus, neither side is in its lowest form. Proof by contradiction that the square root of two is not rational after all.

Sunday, March 16, 2025

Patterns and the Stock Market

On the random walk hypothesis and post-hoc explanations for describing natural processes, from "Patterns and the Stock Market":

While it's certainly entertaining to spin post-hoc explanations of market activity, it's also utterly futile. The market, after all, is a classic example of a "random walk," since the past movement of any particular stock cannot be used to predict its future movement. This inherent randomness was first proposed by the economist Eugene Fama, in the early 1960's. Fama looked at decades of stock market data in order to prove that no amount of rational analysis or knowledge (unless it was illicit insider information) could help you figure out what would happen next. All of the esoteric tools and elaborate theories used by investors to make sense of the market were pure nonsense. Wall Street was like a slot machine.

Alas, the human mind can't resist the allure of explanations, even if they make no sense. We're so eager to find correlations and causation that, when confronted with an inherently stochastic process - like the DJIA, or a slot machine - we invent factors to fixate on. The end result is a blinkered sort of overconfidence, in which we're convinced we've solved a system that has no solution.

"Is an economic recession a divergence from the market trend that eventually reverses over time? Or is it more analogous to a random walk?" asked the student.

The master struck him on the head with a walking stick. "Economic recessions are primarily qualitative; attempting to measure them is meaningless."

Quantitative fundamentals play a role in shaping market dynamics, but in a Bayesian spirit, so too does information and discourse circulating within those markets. "It's priced in," is an expression people use to describe the way that the market's dynamic is not just a sum of monetary fundamentals but also a sum of qualitative sentiment, which is far more difficult to quantify.

Sure, you could try to measure a market recession with GDP output, but if you want an even better attempt at understanding one, you may have greater success communicating with actual market participants—but even then, perspectives will be wildly subjective.

Recessions aside—one could also attempt, at any time, to use technical analysis to predict short-term stock prices. But you may also simply end up straining at gnats.

Furthermore, if someone is claiming to know the true reasons why the market went up or down, there's a significant possibility they are lying. Either by not knowing any better, or deliberately lying.

If they actually knew—if they actually possessed that knowledge—they would have used it to make money.

Wednesday, March 12, 2025

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format.

AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019.^[1]

yt-dlp, the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging:

$ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" https://www.youtube.com/watch?v=dQw4w9WgXcQ

For example, the format command above does not actually properly extract the best possible formats for all YouTube urls on my OSX machine.

This usually happens in cases where a YouTube URL tries to serve a .webm file. If you were to try using the above format flag to attempt extracting the best quality mp4 compatible audio and video from a list of youtube urls -- and you come across a YouTube url that serves a .webm file -- yt-dlp won't error out, abort, or skip the url. Instead, yt-dlp will extract and generate video that's improperly formatted -- .mp4 files that cannot be opened or played.

However, we can fix this problem without even bothering yt-dlp with a pull request. Because yt-dlp does give us the capability to dump out all of the possible audio and video formats available for any video by using the -F flag:

$ yt-dlp -F "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
[youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading tv client config
[youtube] dQw4w9WgXcQ: Downloading player b21600d5
[youtube] dQw4w9WgXcQ: Downloading tv player API JSON
[youtube] dQw4w9WgXcQ: Downloading ios player API JSON
[youtube] dQw4w9WgXcQ: Downloading m3u8 information
[info] Available formats for dQw4w9WgXcQ:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC          VBR ACODEC      ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb3 mhtml 48x27        0    │                  mhtml │ images                                  storyboard
sb2 mhtml 80x45        1    │                  mhtml │ images                                  storyboard
sb1 mhtml 160x90       1    │                  mhtml │ images                                  storyboard
sb0 mhtml 320x180      1    │                  mhtml │ images                                  storyboard
233 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
234 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
249 webm  audio only      2 │    1.18MiB   46k https │ audio only          opus        46k 48k [en] low, webm_dash
250 webm  audio only      2 │    1.55MiB   61k https │ audio only          opus        61k 48k [en] low, webm_dash
140 m4a   audio only      2 │    3.27MiB  130k https │ audio only          mp4a.40.2  130k 44k [en] medium, m4a_dash
251 webm  audio only      2 │    3.28MiB  130k https │ audio only          opus       130k 48k [en] medium, webm_dash
602 mp4   256x144     13    │ ~  2.04MiB   81k m3u8  │ vp09.00.10.08   81k video only
269 mp4   256x144     25    │ ~  3.95MiB  156k m3u8  │ avc1.4D400C    156k video only
160 mp4   256x144     25    │    1.78MiB   70k https │ avc1.4d400c     70k video only          144p, mp4_dash
...
270 mp4   1920x1080   25    │ ~123.87MiB 4902k m3u8  │ avc1.640028   4902k video only
//snipped

It turns out it's actually much better to first manually list the formats this way, use grep and awk to extract the best possible codecs for an mp4 file, and then run yt-dlp with the specifically related codecs for each video URL. Here's a Bash script to automate this process, which makes downloading stuff from YouTube easier, in my opinion:

#!/bin/bash

if [ -z "$1" ]; then
    echo "Usage: $0 <youtube_url>"
    exit 1
fi

url="$1"

processVideo() {
    local videoUrl="$1"

    echo "Fetching available formats for video: $videoUrl"
    formats=$(yt-dlp -F "$videoUrl")
    if [ $? -ne 0 ]; then
        echo "Error: Failed to fetch formats for $videoUrl. Is yt-dlp installed and the URL valid?"
        return
    fi

    videoFormat=$(echo "$formats" | grep 'mp4' | grep -E 'avc1' | \
    awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}' | \
    sort -k2 -nr | awk '{print $1}' | head -1)

    if [ -z "$videoFormat" ]; then
        echo "No AVC1 video format found, falling back to any MP4 format."
        videoFormat=$(echo "$formats" | grep 'mp4' | \
        awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}' | \
        sort -k2 -nr | awk '{print $1}' | head -1)
    fi

    audioFormat=$(echo "$formats" | grep 'm4a' | \
    awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}' | \
    sort -k2 -nr | awk '{print $1}' | head -1)

    if [ -z "$videoFormat" ] || [ -z "$audioFormat" ]; then
        echo "Error: No compatible MP4 video or M4A audio formats found for $videoUrl!"
        return
    fi

    echo "Selected video format: $videoFormat [MP4 : AVC1 preferred]"
    echo "Selected audio format: $audioFormat [M4A : highest quality]"

    echo "Downloading video with yt-dlp..."
    yt-dlp --restrict-filenames \
    -f "${videoFormat}+${audioFormat}" \
    --merge-output-format mp4 "$videoUrl"

    if [ $? -ne 0 ]; then
        echo "Error: Failed to download video. Check the format IDs and URL."
    fi
}

isPlaylist() {
    if echo "$url" | grep -q "list="; then
        return 0 
    else
        return 1 
    fi
}

if isPlaylist; then
    echo "Processing playlist..."
    videoUrls=$(yt-dlp --flat-playlist --get-url "$url")

    if [ -z "$videoUrls" ]; then
        echo "Error: No videos found in the playlist. Is the URL correct?"
        exit 1
    fi

    for videoUrl in $videoUrls; do
        echo "Processing video: $videoUrl"
        processVideo "$videoUrl"
    done
else
    echo "Processing single video..."
    processVideo "$url"
fi

We grab the entire "available formats" table as input, storing it as plaintext in the $formats variable. We then grep $formats for 'mp4' listings, then grep again, further filtering for listings that use the AVC1 H.264 codec. If it doesn't find AVC1, we fall back to simply whatever is MP4 compatible. After filtering twice with grep, our list looks something like this:

269 mp4   256x144     25    | ~  3.95MiB  156k m3u8  | avc1.4D400C    156k video only
160 mp4   256x144     25    |    1.78MiB   70k https | avc1.4d400c     70k video only          144p, mp4_dash
229 mp4   426x240     25    | ~  5.73MiB  227k m3u8  | avc1.4D4015    227k video only
133 mp4   426x240     25    |    2.88MiB  114k https | avc1.4d4015    114k video only          240p, mp4_dash
230 mp4   640x360     25    | ~ 12.09MiB  478k m3u8  | avc1.4D401E    478k video only
134 mp4   640x360     25    |    5.42MiB  214k https | avc1.4d401e    214k video only          360p, mp4_dash
18  mp4   640x360     25  2 | ≈  8.68MiB  343k https | avc1.42001E         mp4a.40.2       44k [en] 360p
231 mp4   854x480     25    | ~ 16.69MiB  660k m3u8  | avc1.4D401E    660k video only
135 mp4   854x480     25    |    8.28MiB  328k https | avc1.4d401e    328k video only          480p, mp4_dash
232 mp4   1280x720    25    | ~ 28.59MiB 1131k m3u8  | avc1.4D401F   1131k video only
136 mp4   1280x720    25    |   16.01MiB  633k https | avc1.4d401f    633k video only          720p, mp4_dash
270 mp4   1920x1080   25    | ~123.87MiB 4902k m3u8  | avc1.640028   4902k video only
137 mp4   1920x1080   25    |   76.46MiB 3025k https | avc1.640028   3025k video only          1080p, mp4_dash
//snipped

Then we use a for statement with awk and NF to loop through all of the fields, parsing the ID and TBR columns. The TBR column contains the bitrate. awk helps to extract the bitrate from the tbr table column, the first field the parser sees ending with a lowercase "k.":

awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}'

At this point, our output looks something like this -- just a list of mp4 IDs and bitrates from our AVC1 list:

269 135k
160 66k
230 565k
134 353k
232 2396k
...
137 3025k
270 4902k
//snipped

Afterward, we use sort to further select for the listing with the highest bitrate -- then awk and head -1 to ensure we print back only the ID of the mp4 video file listing with the highest bitrate.

sort -k2 -nr | awk '{print $1}' | head -1)

Our final output is just 270, the ID, which is what we pass to yt-dlp for the video portion of the download.

We repeat the process for the audio file listings by grepping for lines containing the m4a format extension. Again, we print the ID and TBR bitrate columns, sorting and extracting the related ID for the audio file with the highest bitrate.

We pass both the high quality video and audio IDs to yt-dlp for downloading. yt-dlp automagically merges these two files to produce a finalized MP4.

You could modify the grep and awk statements any other preferred video format, but this bash script works for downloading lectures I can natively watch and listen to on OSX. Here's the default yt-dlp package listing the available video formats, and below is an example of our Bash script that uses yt-dlp to help us extract the highest quality AVC1 MP4 files and make a portable, high quality video.

% yt-dlp -F "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
[youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading tv client config
[youtube] dQw4w9WgXcQ: Downloading player 6b3caec8
[youtube] dQw4w9WgXcQ: Downloading tv player API JSON
[youtube] dQw4w9WgXcQ: Downloading ios player API JSON
[youtube] dQw4w9WgXcQ: Downloading m3u8 information
[info] Available formats for dQw4w9WgXcQ:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC          VBR ACODEC      ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb3 mhtml 48x27        0    │                  mhtml │ images                                  storyboard
sb2 mhtml 80x45        1    │                  mhtml │ images                                  storyboard
sb1 mhtml 160x90       1    │                  mhtml │ images                                  storyboard
sb0 mhtml 320x180      1    │                  mhtml │ images                                  storyboard
233 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
234 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
249 webm  audio only      2 │    1.18MiB   46k https │ audio only          opus        46k 48k [en] low, webm_dash
250 webm  audio only      2 │    1.55MiB   61k https │ audio only          opus        61k 48k [en] low, webm_dash
140 m4a   audio only      2 │    3.27MiB  130k https │ audio only          mp4a.40.2  130k 44k [en] medium, m4a_dash
251 webm  audio only      2 │    3.28MiB  130k https │ audio only          opus       130k 48k [en] medium, webm_dash
602 mp4   256x144     13    │ ~  2.04MiB   81k m3u8  │ vp09.00.10.08   81k video only
269 mp4   256x144     25    │ ~  3.95MiB  156k m3u8  │ avc1.4D400C    156k video only
160 mp4   256x144     25    │    1.78MiB   70k https │ avc1.4d400c     70k video only          144p, mp4_dash
603 mp4   256x144     25    │ ~  3.88MiB  154k m3u8  │ vp09.00.11.08  154k video only
278 webm  256x144     25    │    2.29MiB   91k https │ vp9             91k video only          144p, webm_dash
394 mp4   256x144     25    │    1.41MiB   56k https │ av01.0.00M.08   56k video only          144p, mp4_dash
229 mp4   426x240     25    │ ~  5.73MiB  227k m3u8  │ avc1.4D4015    227k video only
133 mp4   426x240     25    │    2.88MiB  114k https │ avc1.4d4015    114k video only          240p, mp4_dash
604 mp4   426x240     25    │ ~  7.26MiB  287k m3u8  │ vp09.00.20.08  287k video only
242 webm  426x240     25    │    3.72MiB  147k https │ vp9            147k video only          240p, webm_dash
395 mp4   426x240     25    │    2.77MiB  109k https │ av01.0.00M.08  109k video only          240p, mp4_dash
230 mp4   640x360     25    │ ~ 12.09MiB  478k m3u8  │ avc1.4D401E    478k video only
134 mp4   640x360     25    │    5.42MiB  214k https │ avc1.4d401e    214k video only          360p, mp4_dash
18  mp4   640x360     25  2 │ ≈  8.68MiB  343k https │ avc1.42001E         mp4a.40.2       44k [en] 360p
605 mp4   640x360     25    │ ~ 14.26MiB  564k m3u8  │ vp09.00.21.08  564k video only
243 webm  640x360     25    │    6.32MiB  250k https │ vp9            250k video only          360p, webm_dash
396 mp4   640x360     25    │    4.85MiB  192k https │ av01.0.01M.08  192k video only          360p, mp4_dash
231 mp4   854x480     25    │ ~ 16.69MiB  660k m3u8  │ avc1.4D401E    660k video only
135 mp4   854x480     25    │    8.28MiB  328k https │ avc1.4d401e    328k video only          480p, mp4_dash
606 mp4   854x480     25    │ ~ 19.74MiB  781k m3u8  │ vp09.00.30.08  781k video only
244 webm  854x480     25    │    8.92MiB  353k https │ vp9            353k video only          480p, webm_dash
397 mp4   854x480     25    │    8.18MiB  324k https │ av01.0.04M.08  324k video only          480p, mp4_dash
232 mp4   1280x720    25    │ ~ 28.59MiB 1131k m3u8  │ avc1.4D401F   1131k video only
136 mp4   1280x720    25    │   16.01MiB  633k https │ avc1.4d401f    633k video only          720p, mp4_dash
609 mp4   1280x720    25    │ ~ 29.81MiB 1180k m3u8  │ vp09.00.31.08 1180k video only
247 webm  1280x720    25    │   14.65MiB  580k https │ vp9            580k video only          720p, webm_dash
398 mp4   1280x720    25    │   14.98MiB  593k https │ av01.0.05M.08  593k video only          720p, mp4_dash
270 mp4   1920x1080   25    │ ~123.87MiB 4902k m3u8  │ avc1.640028   4902k video only
137 mp4   1920x1080   25    │   76.46MiB 3025k https │ avc1.640028   3025k video only          1080p, mp4_dash
614 mp4   1920x1080   25    │ ~ 71.55MiB 2831k m3u8  │ vp09.00.40.08 2831k video only
248 webm  1920x1080   25    │   39.24MiB 1552k https │ vp9           1552k video only          1080p, webm_dash
399 mp4   1920x1080   25    │   27.67MiB 1095k https │ av01.0.08M.08 1095k video only          1080p, mp4_dash
616 mp4   1920x1080   25    │ ~144.16MiB 5704k m3u8  │ vp09.00.40.08 5704k video only          Premium

% ./yt.sh "https://www.youtube.com/watch?v=dQw4w9WgXcQ" 
Processing single video...
Fetching available formats for video: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Selected video format: 270 [MP4 : AVC1 preferred]
Selected audio format: 140 [M4A : highest quality]
Downloading video with yt-dlp...
[youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading tv client config
[youtube] dQw4w9WgXcQ: Downloading player 6b3caec8
[youtube] dQw4w9WgXcQ: Downloading tv player API JSON
[youtube] dQw4w9WgXcQ: Downloading ios player API JSON
[youtube] dQw4w9WgXcQ: Downloading m3u8 information
[info] dQw4w9WgXcQ: Downloading 1 format(s): 270+140
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 39
[download] Destination: Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f270.mp4
[download] 100% of   78.70MiB in 00:00:27 at 2.83MiB/s
[download] Destination: Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f140.m4a
[download] 100% of    3.27MiB in 00:00:00 at 4.50MiB/s
[Merger] Merging formats into "Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].mp4"
Deleting original file Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f270.mp4 (pass -k to keep)
Deleting original file Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f140.m4a (pass -k to keep)

% exiftool Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-\[dQw4w9WgXcQ\].mp4
ExifTool Version Number         : 13.10
File Name                       : Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].mp4
Directory                       : .
File Size                       : 84 MB
File Modification Date/Time     : 2024:05:30 01:43:41-04:00
File Access Date/Time           : 2024:05:30 01:43:41-04:00
File Inode Change Date/Time     : 2025:03:15 19:30:18-04:00
File Permissions                : -rw-r--r--
File Type                       : MP4
File Type Extension             : mp4
MIME Type                       : video/mp4
Major Brand                     : MP4 Base Media v1 [IS0 14496-12:2003]
Minor Version                   : 0.2.0
Compatible Brands               : isom, iso2, avc1, mp41
Movie Header Version            : 0
Create Date                     : 0000:00:00 00:00:00
Modify Date                     : 0000:00:00 00:00:00
Time Scale                      : 1000
Duration                        : 0:03:32
Preferred Rate                  : 1
Preferred Volume                : 100.00%
Preview Time                    : 0 s
Preview Duration                : 0 s
Poster Time                     : 0 s
Selection Time                  : 0 s
Selection Duration              : 0 s
Current Time                    : 0 s
Next Track ID                   : 3
Track Header Version            : 0
Track Create Date               : 0000:00:00 00:00:00
Track Modify Date               : 0000:00:00 00:00:00
Track ID                        : 1
Track Duration                  : 0:03:32
Track Layer                     : 0
Track Volume                    : 0.00%
Image Width                     : 1920
Image Height                    : 1080
Graphics Mode                   : srcCopy
Op Color                        : 0 0 0
Compressor ID                   : avc1
Source Image Width              : 1920
Source Image Height             : 1080
X Resolution                    : 72
Y Resolution                    : 72
Bit Depth                       : 24
Color Profiles                  : nclx
Color Primaries                 : BT.709
Transfer Characteristics        : BT.709
Matrix Coefficients             : BT.709
Video Full Range Flag           : Limited
Pixel Aspect Ratio              : 1:1
Buffer Size                     : 0
Max Bitrate                     : 3023409
Average Bitrate                 : 3023409
Video Frame Rate                : 25
Matrix Structure                : 1 0 0 0 1 0 0 0 1
Media Header Version            : 0
Media Create Date               : 0000:00:00 00:00:00
Media Modify Date               : 0000:00:00 00:00:00
Media Time Scale                : 44100
Media Duration                  : 0:03:32
Media Language Code             : eng
Handler Description             : ISO Media file produced by Google Inc.
Balance                         : 0
Audio Format                    : mp4a
Audio Channels                  : 2
Audio Bits Per Sample           : 16
Audio Sample Rate               : 44100
Handler Type                    : Metadata
Handler Vendor ID               : Apple
Encoder                         : Lavf61.7.100
Media Data Size                 : 83529233
Media Data Offset               : 171478
Image Size                      : 1920x1080
Megapixels                      : 2.1
Avg Bitrate                     : 3.15 Mbps
Rotation                        : 0

Footnotes

Advanced Video Coding ↩︎

Monday, March 10, 2025

Subshells in Powershell

Previously, I wrote a post about how it's possible to create a "subshell" in Windows analogous to the subshell feature available in Bash on Linux—because Microsoft Windows doesn't actually have native subshell capability the same way that Linux does. The script below is an improvement on the same previous method of using the .NET System.Diagnostics trick. But this new version correctly redirects the standard output:

$x = New-Object System.Diagnostics.ProcessStartInfo
$x.FileName = "cmd.exe"
$x.Arguments = "/c echo %PATH%"
$x.UseShellExecute = $false
$x.RedirectStandardOutput = $true  
$x.EnvironmentVariables.Remove("Path")
$x.EnvironmentVariables.Add("PATH", "C:\custom\path")
$p = New-Object System.Diagnostics.Process
$p.StartInfo = $x
$p.Start() | Out-Null
$output = $p.StandardOutput.ReadToEnd()
$p.WaitForExit()
Write-Output $output

Real-World Example

$customPath2 = "C:\custom\path\2"

$data = @{
    Path = $customPath2  
    Timestamp = Get-Date
    ProcessID = $PID  
}

$x = New-Object System.Diagnostics.ProcessStartInfo
$x.FileName = "cmd.exe"
$x.Arguments = "/c echo %PATH%"
$x.UseShellExecute = $false
$x.RedirectStandardOutput = $true
$x.RedirectStandardError = $true

$data["SubshellError"] = $stderr

$x.EnvironmentVariables.Remove("Path")
$x.EnvironmentVariables.Add("PATH", $customPath2)

$p = New-Object System.Diagnostics.Process
$p.StartInfo = $x
$p.Start() | Out-Null

$output = $p.StandardOutput.ReadToEnd()
$stderr = $p.StandardError.ReadToEnd() 
$p.WaitForExit()

$data["SubshellOutput"] = $output
$data["SubshellError"] = $stderr

$data

$data

Name                           Value
----                           -----
ProcessID                      11852
Path                           C:\custom\path\2
SubshellOutput                 C:\custom\path\2...
SubshellError
Timestamp                      3/10/2025 7:05:01 PM

Friday, February 21, 2025

Dynamic Linking

An insightful passage on dynamic linking, global offset tables, and procedure linkage tables, from Jan Hubicka:

Much as the global offset table redirects position-independent address calculations to absolute locations, the procedure linkage table redirects position-independent function calls to absolute locations. The link editor cannot resolve execution transfers (such as function calls) from one executable or shared object to another. Consequently, the link editor arranges to have the program transfer control to entries in the procedure linkage table. On the AMD64 architecture, procedure linkage tables reside in shared text, but they use addresses in the private global offset table. The dynamic linker determines the destinations' absolute addresses and modifies the global offset table's memory image accordingly.

The much lauded paper "How to Write Shared Libaries" by Ulrich Drepper explains this and much more in great detail. (ELf structures, relocations, symbol handling, optimizations, etc.). But with regard to the Global Offset Table and Procedure Linkage Tables, a nice passage from Ulrich's paper:

The Global Offset Table (GOT) and Procedure Linkage Table (PLT) are the two data structures central to the ELF (Executable and Linkable Format) run-time. We will now introduce the reasons why they are used and what consequences arise from that. Relocations are created for source constructs like:

extern int foo;
extern int bar(int);

int call_bar(void) {
    return bar(foo);
}

The call to bar requires two relocations: One to load the value of foo. Another to find the address of bar. If the code were generated knowing the addresses of the variable and the function, the assembler instructions would directly load from or jump to the address. For IA32, the code would look like this:

pushl foo
call bar

This would encode the addresses of foo and bar as part of the instruction in the text segment. However, if the address is only known to the dynamic linker, the text segment would have to be modified at run-time. As we learned earlier, this must be avoided. Therefore, the code generated for DSOs (Dynamic Shared Objects), i.e., when using -fpic or -fPIC, looks like this:

movl foo@GOT(%ebx), %eax
pushl (%eax)
call bar@PLT

The address of the variable foo is now not part of the instruction. Instead it is loaded from the GOT. The address of the location in the GOT relative to the PIC register value (%ebx) is known at link-time. Therefore the text segment does not have to be changed, only the GOT.

The situation for the function call is similar. The function bar is not called directly. Instead control is transferred to a stub for bar in the PLT (indicated by bar@PLT). For IA-32 the PLT itself does not have to be modified and can be placed in a read-only segment, each entry is 16 bytes in size. Only the GOT is modified and each entry consists of 4 bytes. The structure of the PLT in an IA-32 DSO looks like this:

.PLT0:
    pushl 4(%ebx)
    jmp *8(%ebx)
    nop
    nop

.PLT1:
    jmp *name1@GOT(%ebx)
    pushl $offset1
    jmp .PLT0@PC

.PLT2:
    jmp *name2@GOT(%ebx)
    pushl $offset2
    jmp .PLT0@PC

Position-independent Executables

Address space layout randomization, aka ASLR -- and position independent executables (PIE), are used to improve the security of modern operating systems by making memory addresses less predictable.

Position-independent executables let systems more effectively use ASLR to randomize their memory layouts at runtime. The entry point offsets to functions remain fixed, while the base address is randomized.

$ readelf -h /usr/bin/ls
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x6d30
  Start of program headers:          64 (bytes into file)
  Start of section headers:          140328 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30

On a system with PIE enabled binaries, the base address will be modified because it gets computed in addition with another value. Runtime addresses then, essentially get calculated like this:

Runtime Address = Randomized Base Address + Entry Point Offset

We can see some of the code for implementing the address space layout randomization of ELF files in the source code on Linus Torvalds' github in fs/binfmt_elf.c:

	for(i = 0, elf_ppnt = elf_phdata;
	    i < elf_ex->e_phnum; i++, elf_ppnt++) {
		int elf_prot, elf_flags;
		unsigned long k, vaddr;
		unsigned long total_size = 0;
		unsigned long alignment;

		if (elf_ppnt->p_type != PT_LOAD)
			continue;

		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
				     !!interpreter, false);

		elf_flags = MAP_PRIVATE;

		vaddr = elf_ppnt->p_vaddr;
		/*
		 * The first time through the loop, first_pt_load is true:
		 * layout will be calculated. Once set, use MAP_FIXED since
		 * we know we've already safely mapped the entire region with
		 * MAP_FIXED_NOREPLACE in the once-per-binary logic following.
		 */
		if (!first_pt_load) {
			elf_flags |= MAP_FIXED;
		} else if (elf_ex->e_type == ET_EXEC) {
			/*
			 * This logic is run once for the first LOAD Program
			 * Header for ET_EXEC binaries. No special handling
			 * is needed.
			 */
			elf_flags |= MAP_FIXED_NOREPLACE;
		} else if (elf_ex->e_type == ET_DYN) {
			/*
			 * This logic is run once for the first LOAD Program
			 * Header for ET_DYN binaries to calculate the
			 * randomization (load_bias) for all the LOAD
			 * Program Headers.
			 */

			/*
			 * Calculate the entire size of the ELF mapping
			 * (total_size), used for the initial mapping,
			 * due to load_addr_set which is set to true later
			 * once the initial mapping is performed.
			 *
			 * Note that this is only sensible when the LOAD
			 * segments are contiguous (or overlapping). If
			 * used for LOADs that are far apart, this would
			 * cause the holes between LOADs to be mapped,
			 * running the risk of having the mapping fail,
			 * as it would be larger than the ELF file itself.
			 *
			 * As a result, only ET_DYN does this, since
			 * some ET_EXEC (e.g. ia64) may have large virtual
			 * memory holes between LOADs.
			 *
			 */
			total_size = total_mapping_size(elf_phdata,
							elf_ex->e_phnum);
			if (!total_size) {
				retval = -EINVAL;
				goto out_free_dentry;
			}

			/* Calculate any requested alignment. */
			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);

			/*
			 * There are effectively two types of ET_DYN
			 * binaries: programs (i.e. PIE: ET_DYN with PT_INTERP)
			 * and loaders (ET_DYN without PT_INTERP, since they
			 * _are_ the ELF interpreter). The loaders must
			 * be loaded away from programs since the program
			 * may otherwise collide with the loader (especially
			 * for ET_EXEC which does not have a randomized
			 * position). For example to handle invocations of
			 * "./ld.so someprog" to test out a new version of
			 * the loader, the subsequent program that the
			 * loader loads must avoid the loader itself, so
			 * they cannot share the same load range. Sufficient
			 * room for the brk must be allocated with the
			 * loader as well, since brk must be available with
			 * the loader.
			 *
			 * Therefore, programs are loaded offset from
			 * ELF_ET_DYN_BASE and loaders are loaded into the
			 * independently randomized mmap region (0 load_bias
			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
			 */
			if (interpreter) {
				/* On ET_DYN with PT_INTERP, we do the ASLR. */
				load_bias = ELF_ET_DYN_BASE;
				if (current->flags & PF_RANDOMIZE)
					load_bias += arch_mmap_rnd();
				/* Adjust alignment as requested. */
				if (alignment)
					load_bias &= ~(alignment - 1);
				elf_flags |= MAP_FIXED_NOREPLACE;
			} else {
				/*
				 * For ET_DYN without PT_INTERP, we rely on
				 * the architectures's (potentially ASLR) mmap
				 * base address (via a load_bias of 0).
				 *
				 * When a large alignment is requested, we
				 * must do the allocation at address "0" right
				 * now to discover where things will load so
				 * that we can adjust the resulting alignment.
				 * In this case (load_bias != 0), we can use
				 * MAP_FIXED_NOREPLACE to make sure the mapping
				 * doesn't collide with anything.
				 */
				if (alignment > ELF_MIN_ALIGN) {
					load_bias = elf_load(bprm->file, 0, elf_ppnt,
							     elf_prot, elf_flags, total_size);
					if (BAD_ADDR(load_bias)) {
						retval = IS_ERR_VALUE(load_bias) ?
							 PTR_ERR((void*)load_bias) : -EINVAL;
						goto out_free_dentry;
					}
					vm_munmap(load_bias, total_size);
					/* Adjust alignment as requested. */
					if (alignment)
						load_bias &= ~(alignment - 1);
					elf_flags |= MAP_FIXED_NOREPLACE;
				} else
					load_bias = 0;
			}

			/*
			 * Since load_bias is used for all subsequent loading
			 * calculations, we must lower it by the first vaddr
			 * so that the remaining calculations based on the
			 * ELF vaddrs will be correctly offset. The result
			 * is then page aligned.
			 */
			load_bias = ELF_PAGESTART(load_bias - vaddr);
		}

//snipped

	retval = create_elf_tables(bprm, elf_ex, interp_load_addr,
				   e_entry, phdr_addr);
	if (retval < 0)
		goto out;

	mm = current->mm;
	mm->end_code = end_code;
	mm->start_code = start_code;
	mm->start_data = start_data;
	mm->end_data = end_data;
	mm->start_stack = bprm->p;

	if ((current->flags & PF_RANDOMIZE) && (snapshot_randomize_va_space > 1)) {
		/*
		 * For architectures with ELF randomization, when executing
		 * a loader directly (i.e. no interpreter listed in ELF
		 * headers), move the brk area out of the mmap region
		 * (since it grows up, and may collide early with the stack
		 * growing down), and into the unused ELF_ET_DYN_BASE region.
		 */
		if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
		    elf_ex->e_type == ET_DYN && !interpreter) {
			mm->brk = mm->start_brk = ELF_ET_DYN_BASE;
		} else {
			/* Otherwise leave a gap between .bss and brk. */
			mm->brk = mm->start_brk = mm->brk + PAGE_SIZE;
		}

		mm->brk = mm->start_brk = arch_randomize_brk(mm);
//snipped

We can visualize this with a simple C program that prints the address of its main function. PIE would also randomize any other function addresses.

#include <stdio.h>

int main() {
    printf("Address of main: %p\n", main);
    return 0;
}

$ gcc -fpie main.c -o main
hexagr@vr:~$ ./main 
Address of main: 0x62498320c149
hexagr@vr:~$ ./main 
Address of main: 0x61bcab5a7149
hexagr@vr:~$ ./main 
Address of main: 0x587688ff8149

The fixed offset to the address of main is 0x149, while the base address -- the high bits -- are randomized:

High bits (randomized base address)	Low bits (entry point offset)
0x62498320c000	0x149
0x61bcab5a7000	0x149
0x587688ff8000	0x149

PIC

If we force the additional use of the fpic flag, we can make gcc generate position-independent code (assembly) which (if not optimized away) might use the Global Offset Table (GOT).

The position-independent code (PIC) feature is distinctly different from the position-independent executable (PIE) feature.

PIE is an extension of PIC and applies to the entire binary, utilizing ASLR to randomize its base address. This is useful for standalone executables.

PIC enables code to be loaded at any address by using relative addressing. It stores absolute addresses of global variables and functions in the Global Offset Table, which resolves them at runtime. This is useful for shared libraries, since code is loaded at random addresses on systems with ASLR enabled. The Global Offset Table serves as a layer of indirection to calculate relative addresses that still allows code such as shared libraries to run in a position-independent way.

Thus, the -fpie flag is for executables, whereas the -fPIC flag would be appropriate for a shared library. For standalone executables, the compiler typically uses RIP-relative addressing: leaq main(%rip), %rax. Note that PIE executables often do not need the GOT for internal symbols (like main) because the entire binary gets relocated and often everything can work with RIP-relative addressing. But in cases where external symbols are necessary, position-independent executables will still need the Global Offset Table.

For position-independent code, the compiler emits assembly to use the Global Offset Table to calculate relative addresses: movq main@GOTPCREL(%rip), %rax.

$ gcc -fPIC -S entry.c -o entry_fpic.s
$ gcc -fpie -S entry.c -o entry_fpie.s
$ diff entry_fpic.s entry_fpie.s 
18c18
< 	movq	main@GOTPCREL(%rip), %rax
---
> 	leaq	main(%rip), %rax

Theory and Practice

Implementing something that works but that you don't entirely understand is like finding a new chord that you don't know the name of yet. You have an intuition that it's a pleasant chord—it may work—but you might lack the ability to fully articulate or describe why. In a moment such as that, it's good to pause and consult the literature until you can fully articulate it.

There's a subtle, incredible difference between implementing a thing and understanding a thing. If you find a way to do something but move forward without fully understanding why it works, the only thing you carry forward is a mistaken understanding about it. Its usefulness remains only a happy accident.

Happy accidents are ok. They're sometimes useful if they're stepping stones on the way to greater learning.

Practicing music is a lot like that. But repetition can be a double-edged sword. For example, if you make a mistake but don't stop and correct it—and instead just continue playing—then you won't actually improve your understanding or ability to play the piece. Instead, you'll just get very good at making the mistake.

Thursday, February 20, 2025

G.K Chesterton on Volition

An interesting passage from G.K. Chesterton's "Orthodoxy":

All the will-worshippers, from Nietzsche to Mr. Davidson, are really quite empty of volition. They cannot will, they can hardly wish. And if any one wants a proof of this, it can be found quite easily. It can be found in this fact : that they always talk of will as something that expands and breaks out. But it is quite the opposite. Every act of will is an act of self-limitation. To desire action is to desire limitation. In that sense every act is an act of self-sacrifice. When you choose anything, you reject everything else. That objection, which men of this school used to make to the act of marriage, is really an objection to every act. Every act is an irrevocable selection and exclusion. Just as when you marry one woman you give up all the others, so when you take one course of action you give up all the other courses. If you become King of England, you give up the post of Beadle in Brompton. If you go to Rome, you sacrifice a rich suggestive life in Wimbledon. It is the existence of this negative or limiting side of will that makes most of the talk of the anarchic will- worshippers little better than nonsense. For instance, Mr. John Davidson tells us to have nothing to do with "Thou shalt not " ; but it is surely obvious that " Thou shalt not " is only one of the necessary corollaries of " I will." " I will go to the Lord Mayor's Show, and thou shalt not stop me." Anarchism adjures us to be bold creative artists, and care for no laws or limits. But it is impossible to be an artist and not care for laws and limits. Art is limitation ; the essence of every picture is the frame. If you draw a giraffe, you must draw him with a long neck. If, in your bold creative way, you hold yourself free to draw a giraffe with a short neck, you will really find that you are not free to draw a giraffe. The moment you step into the world of facts, you step into a world of limits.

Wednesday, February 19, 2025

Searching for Elf Magic

Elfland

Just as Windows has its various executable formats, so too does Linux. In this land, there are Elfs, also known as executable and linkable format files. If we look at elf.h, we can see the structures which constitute the ELF format:

#define EI_NIDENT       16
 
typedef struct {
        unsigned char   e_ident[EI_NIDENT]; 
        Elf32_Half      e_type;
        Elf32_Half      e_machine;
        Elf32_Word      e_version;
        Elf32_Addr      e_entry;
        Elf32_Off       e_phoff;
        Elf32_Off       e_shoff;
        Elf32_Word      e_flags;
        Elf32_Half      e_ehsize;
        Elf32_Half      e_phentsize;
        Elf32_Half      e_phnum;
        Elf32_Half      e_shentsize;
        Elf32_Half      e_shnum;
        Elf32_Half      e_shstrndx;
} Elf32_Ehdr;

typedef struct {
        unsigned char   e_ident[EI_NIDENT]; 
        Elf64_Half      e_type;
        Elf64_Half      e_machine;
        Elf64_Word      e_version;
        Elf64_Addr      e_entry;
        Elf64_Off       e_phoff;
        Elf64_Off       e_shoff;
        Elf64_Word      e_flags;
        Elf64_Half      e_ehsize;
        Elf64_Half      e_phentsize;
        Elf64_Half      e_phnum;
        Elf64_Half      e_shentsize;
        Elf64_Half      e_shnum;
        Elf64_Half      e_shstrndx;
} Elf64_Ehdr;
e_ident

Straightforward enough? This is how the kernel sees Executable Linux Files. Here's a quick rundown of what each of these field names formally represent within the ELF format:

e_ident: stores the file's identification info, like magic number, class, and endianness.
e_type: tells the file's type (e.g., executable, shared library).
e_machine: describes the architecture (e.g., x86, ARM).
e_version: version of the ELF format.
e_entry: address where the program starts running.
e_phoff: offset to the program header table.
e_shoff: offset to the section header table.
e_flags: flags for specific machine behaviors.
e_ehsize: size of the ELF header.
e_phentsize: size of each program header entry.
e_phnum: number of program header entries.
e_shentsize: size of each section header entry.
e_shnum: number of section headers.
e_shstrndx: index to the section name string table.

If we use readelf we can see this for ourselves, along with program and section headers, offsets, relocations, symbol tables, and more.

$ readelf -a /usr/bin/gzip 
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x11fe0
  Start of program headers:          52 (bytes into file)
  Start of section headers:          71084 (bytes into file)
  Flags:                             0x5000400, Version5 EABI, hard-float ABI
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         9
  Size of section headers:           40 (bytes)
  Number of section headers:         28
  Section header string table index: 27

$ readelf -l /usr/bin/ls

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x6d30
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000000036f8 0x00000000000036f8  R      0x1000
  LOAD           0x0000000000004000 0x0000000000004000 0x0000000000004000
                 0x0000000000014db1 0x0000000000014db1  R E    0x1000
  LOAD           0x0000000000019000 0x0000000000019000 0x0000000000019000
                 0x00000000000071b8 0x00000000000071b8  R      0x1000
  LOAD           0x0000000000020f30 0x0000000000021f30 0x0000000000021f30
                 0x0000000000001348 0x00000000000025e8  RW     0x1000
  DYNAMIC        0x0000000000021a38 0x0000000000022a38 0x0000000000022a38
                 0x0000000000000200 0x0000000000000200  RW     0x8
  NOTE           0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  NOTE           0x0000000000000368 0x0000000000000368 0x0000000000000368
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_PROPERTY   0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  GNU_EH_FRAME   0x000000000001e170 0x000000000001e170 0x000000000001e170
                 0x00000000000005ec 0x00000000000005ec  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000020f30 0x0000000000021f30 0x0000000000021f30
                 0x00000000000010d0 0x00000000000010d0  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt 
   03     .init .plt .plt.got .plt.sec .text .fini 
   04     .rodata .eh_frame_hdr .eh_frame 
   05     .init_array .fini_array .data.rel.ro .dynamic .got .data .bss 
   06     .dynamic 
   07     .note.gnu.property 
   08     .note.gnu.build-id .note.ABI-tag 
   09     .note.gnu.property 
   10     .eh_frame_hdr 
   11     
   12     .init_array .fini_array .data.rel.ro .dynamic .got

I won't be covering all of the ELF sections in this blog post. For a comprehensive breakdown of Linux sections, I recommend this page: https://stevens.netmeister.org/631/elf.html

In this post, we're going to write x64 assembly to search for ELF magic numbers: 7f 45 4c 46.

So, we'll begin by loading all of the strings we need for our program into the .data section. We'll setup variables for the magic number, success message, usage message, error message, and a fail message.

section .data
    elf_magic db 0x7F, 'E', 'L', 'F'   ; ELF magic number
    msg db "[+] ELF magic detected", 10 ; msg to print
    msg_len equ $ - msg
    usage_msg db "Usage: ./elf_check <filename>", 10
    usage_msg_len equ $ - usage_msg
    error_msg db "Error opening file. Please supply a valid file path.", 10
    error_msg_len equ $ - error_msg
    not_elf db "[-] No ELF magic detected", 10
    not_elf_len equ $ - not_elf

section .bss
    buffer resb 4            ; allocate 4 bytes for our read buffer

section .text
    global _start

We'll set the strings and their lengths. elf_magic db defines our byte signature, while the correlated msg db holds our success msg string and an ASCII newline (10). We also define its length with equ, defining the msg_len as a constant. The $ delimiter indicates the current location and subtracts it from the correlating string, e.g. msg_len, yielding the length of the string of the msg variable. We repeat this design pattern for the other strings we use in our program.

We also create a .bss section -- the block starting symbols -- which hold our statically allocated variable resb. This value in particular allocates a single byte -- which we allocate four of -- for the purpose of reading and looping through a buffer, byte by byte, later in our program.

Each of these directives communicate to the compiler and linker the structure of our executable. For example, the .text subsection global _start tells the linker (ld) where our program actually begins.

Next we'll use x86_64 instructives to communicate that we would like to open a file. If no file path is detected, we create a jump via jl (jump if less) to a usage message indicating that the program requires a valid file path.

With a valid file path supplied, we handle its file descriptor and prepare to process it. Some familiarity with asm is assumed. But I've tried to make the comments clear:

We use the open system call and the O_RDONLY flag to open our file. After setting the arguments, we invoke the call and test if it was successful. If the test is negative, we head for the exit, calling another variation of jump (js; jump if sign flag is set) and bailing out to the .error_opening message.

But if a valid file descriptor is found, we begin processing it and setup to enter a loop to compare the bytes of the supplied file to the ELF magic byte array we stashed in our .data section.

_start:
    mov rdi, [rsp]          ; stack pointer to rdi for our argument
    cmp rdi, 2              ; compare argc to 2 (our executable + 1 argument)
    jl .usage_msg           ; no arg? jump to usage_msg; else, get the filename from argv[1]
    mov rdi, [rsp + 16]     ; rsp+16 (argv1) to rdi

    ; open file (open syscall)
    mov rsi, 0x0            ; rsi, O_RDONLY
    mov rdx, 0              ; rdx for mode, unused
    mov rax, 2              ; syscall number for open
    syscall                 ; open(argv[1], O_RDONLY)

    ; success? check file descriptor, rax 
    test rax, rax           ; check if fd is valid
    js .error_opening       ; jump to .error_opening if open failed

    ; save file descriptor in rbx
    mov rbx, rax            ; rbx file descriptor from rax to rbx

    ; read first 4 bytes from the file (read syscall)
    mov rdi, rbx            ; rdi, file descriptor
    lea rsi, [buffer]       ; load buffer to rsi to store bytes
    mov rdx, 4              ; arg to read four bytes
    mov rax, 0              ; syscall number for read
    syscall                 ; read(file_desc, buffer, 4)
    mov rdi, elf_magic      ; rdi to point to the ELF magic number
    mov rcx, 4              ; set loop counter to 4 bytes
    jmp .compare_loop       ; jump to .compare_loop

When we begin the "read first 4 bytes" portion of the code, we're setting up the arguments, which adheres to the Linux x86_64 calling convention.

Afterward, we invoke a syscall. This is a call to the read() function which does read(fd, buffer, 4).

The system call to read() is really doing this:

read(rdi, rsi, rdx) <------> read(file_descriptor, buffer, 4)

Lastly, we do mov rdi, elf_magic to move the byte signature we're looking for, e.g. elf_magic, to the rdi register, and prepare the rcx register as a loop counter by setting it to 4 just before jumping into .compare_loop.

If you're not familiar with calling conventions, you can read more about them here: "Arguments Passing in Linux"

Register	Argument User Space	Argument Kernel Space
%rax	Not Used	System Call Number
%rdi	Argument 1	Argument 1
%rsi	Argument 2	Argument 2
%rdx	Argument 3	Argument 3
%r10	Not Used	Argument 4
%r8	Argument 5	Argument 5
%r9	Argument 6	Argument 6
%rcx	Argument 4	Destroyed
%r11	Not Used	Destroyed

Next, we want to compare the bytes that we have read from the buffer to the ELF magic bytes we have stored in the .data section of our program. Note that registers such as al and bl are registers for accessing single bytes.

Here's a chart of the registers and their related counterparts. Note: these registers can also be accessed and used independently. One need not use rsi and sil together to access single bytes. One can imagine zig-zagging across the chart below for reads, writes, compares, etc.

For example, if I want to move a byte from rsi to the al register for use in a loop, that is acceptable. But one can of course just use the default associated sil register.

8-byte Register	Bytes 0-3	Bytes 0-1	Byte 0
%rax	%eax	%ax	%al
%rcx	%ecx	%cx	%cl
%rdx	%edx	%dx	%dl
%rbx	%ebx	%bx	%bl
%rsi	%esi	%si	%sil
%rdi	%edi	%di	%dil
%rsp	%esp	%sp	%spl
%rbp	%ebp	%bp	%bpl
%r8	%r8d	%r8w	%r8b
%r9	%r9d	%r9w	%r9b
%r10	%r10d	%r10w	%r10b
%r11	%r11d	%r11w	%r11b
%r12	%r12d	%r12w	%r12b
%r13	%r13d	%r13w	%r13b
%r14	%r14d	%r14w	%r14b
%r15	%r15d	%r15w	%r15b

Our compare loop will use the default associated registers which are shown the chart, e.g. rsi and sil -- as well as rdi and dil:

.compare_loop:
    mov sil, byte [rsi]     ; load byte from buffer
    mov dil, byte [rdi]     ; load byte from elf_magic
    cmp sil, dil            ; compare bytes
    jne .not_elf            ; not equal, not ELF
    inc rsi                 ; move to next byte in buffer
    inc rdi                 ; move to next byte in elf_magic
    loop .compare_loop      ; repeat until counter is 0

    ; if compare_loop completes, the file has ELF magic
    ; print msg
    mov rdi, 1              ; file descriptor 1, stdout, to rdi
    lea rsi, [msg]          ; load address of msg to rsi 
    mov rdx, msg_len        ; length of msg to rdx
    mov rax, 1              ; syscall number for write
    syscall                 ; write(1, msg, msg_len)

    ; exit syscall
    mov rax, 60             ; syscall number for exit
    xor rdi, rdi            ; exit code 0, success
    syscall                 ; exit(0)

This recursively iterates through the bytes -- looping four times courtesy of the rcx counter we set in the _start function. As the loop runs, it calls the inc, e.g. increment, to move forward through the bytes pointed at by rsi and rdi

If the bytes that we load from the buffer match the bytes in elf_magic, then we go forward -- setting up to print the success message by copying the file descriptor for standard output to the rdi register, calling lea to load the effective address of our msg in bracket notation, and the corresponding msg_len we set in the .data section earlier. Last, we invoke our syscall. If all goes right, we should see the message: "[+] ELF magic detected"

However, if .compare_loop iterates and a byte doesn't match the ELF magic signature, then we jump via jne (jump-if-not-equal) register to the .not_elf function.

Below is our .not_elf function. We'll reuse this epilogue design pattern for exiting out of our program two more times, for both our .error_opening and .usage_message functions.

.not_elf:
    ; print .not_elf message
    mov rdi, 1              ; file descriptor 1, stdout, to rdi
    lea rsi, [not_elf]      ; load address of not_elf msg to rsi
    mov rdx, not_elf_len    ; not_elf msg length to rdx
    mov rax, 1              ; syscall number for write
    syscall                 ; write(1, "not_elf", len)

    ; exit syscall
    mov rax, 60             ; syscall number for exit
    mov rdi, 1              ; exit code 1, failure
    syscall                 ; exit(1)

If we use nasm to compile, and then use ld to link our executable, we can test to see if it successfully finds ELF file signatures.

$ nasm -f elf64 -o elfCheck.o elfCheck.asm 
$ ld -s -o elfCheck elfCheck.o
$ ./elfCheck /etc/hostname
[-] No ELF magic detected
$ ./elfCheck /usr/bin/gzip
[+] ELF magic detected

It works! But wait, what if we spoof an ELF file? Then our ELF magic checker has been foiled!

$ echo -n -e '\x7f\x45\x4c\x46' > spoofed_elf
$ xxd spoofed_elf
00000000: 7f45 4c46                                .ELF
$ ./elfCheck spoofed_elf 
[+] ELF magic detected

Rats. We'll have to build an ELF validator that formally verifies Executable Linux Files instead of just searching for magic numbers.

Searching for ELF magic with assembly

Tuesday, February 18, 2025

Patching Apparmor

For a little over a year, AppArmor has been broken on some distributions because of a two line bug that can be found here.

On a default Ubuntu 24.04.1 LTS installation, trying to run aa-enforce /etc/apparmor.d/* to enable apparmor profiles fails with:

Traceback (most recent call last): File "/usr/sbin/aa-enforce", line 33, 
in tool.cmd_enforce() 

File "/usr/lib/python3/dist-packages/apparmor/tools.py", line 134, 
in cmd_enforce for (program, prof_filename, output_name) in 
self.get_next_for_modechange(): 

File "/usr/lib/python3/dist-packages/apparmor/tools.py", line 97, 
in get_next_for_modechange aaui.UI_Info(_('Profile for %s 
not found, skipping') % output_name) 

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
TypeError: 'NoneType' object is not callable 
An unexpected error occurred!

The bug has been fixed in AppArmor but the patch hasn't been pushed upstream to Ubuntu yet. If we pull down the raw corrected file, we can diff and patch and get AppArmor running again. The raw fixed file can be found on Gitlab here.

Let's make a copy of our old tools.py file, just in case. We'll save it to tools.py_backup and then download the new updated version. And finally, diff, patch, and test the new file:

$ cp /usr/lib/python3/dist-packages/apparmor/tools.py /usr/lib/python3/dist-packages/apparmor/tools.py_backup

$ wget https://gitlab.com/apparmor/apparmor/-/raw/6f9e841e74f04cac78da71fd2e8af3f973af94fc/utils/apparmor/tools.py -O /tmp/tools.py

$ diff /usr/lib/python3/dist-packages/apparmor/tools.py /tmp/tools.py 
93c93
<         for (program, _, prof_filename) in self.get_next_to_profile():
---
>         for (program, _ignored, prof_filename) in self.get_next_to_profile():
165c165
<         for (program, _, prof_filename) in self.get_next_to_profile():
---
>         for (program, _ignored, prof_filename) in self.get_next_to_profile():

Nice, we can clearly see the same changes in the AppArmor github repo at commit 6f9e841e.

Diff, Patch, Repeat

If we wanted to create a patch file, we could do so by just saving the diff to an output file, like so:

$ diff -u /usr/lib/python3/dist-packages/apparmor/tools.py /tmp/tools.py > /tmp/fix.diff
$ cat /tmp/fix.diff 
--- /usr/lib/python3/dist-packages/apparmor/tools.py	2024-11-29 20:48:05.365220486 -0500
+++ /tmp/tools.py	2025-02-18 09:39:34.016987110 -0500
@@ -90,7 +90,7 @@
     def get_next_for_modechange(self):
         """common code for mode/flags changes"""
 
-        for (program, _, prof_filename) in self.get_next_to_profile():
+        for (program, _ignored, prof_filename) in self.get_next_to_profile():
             output_name = prof_filename if program is None else program
 
             if not os.path.isfile(prof_filename) or is_skippable_file(prof_filename):
@@ -162,7 +162,7 @@
     def cmd_autodep(self):
         apparmor.loadincludes()
 
-        for (program, _, prof_filename) in self.get_next_to_profile():
+        for (program, _ignored, prof_filename) in self.get_next_to_profile():
             if not program:
                 aaui.UI_Info(_('Please pass an application to generate a profile for, not a profile itself - skipping %s.') % prof_filename)
                 continue

Afterward, we could patch the file like so:

$ patch /usr/lib/python3/dist-packages/apparmor/tools.py /tmp/fix.diff

Or, simply:

$ patch < /tmp/fix.diff

Similarly, we could also reverse the patch with the -R flag and the diff file:

$ patch -R /usr/lib/python3/dist-packages/apparmor/tools.py /tmp/fix.diff

After applying the AppArmor patch, we can enable apparmor-profiles successfully again with the aa-enforce tool:

$ sudo aa-enforce /etc/apparmor.d/*
Setting /etc/apparmor.d/1password to enforce mode.
Profile for /etc/apparmor.d/abi not found, skipping
Profile for /etc/apparmor.d/abstractions not found, skipping
Profile for /etc/apparmor.d/apache2.d not found, skipping
Setting /etc/apparmor.d/balena-etcher to enforce mode.
Setting /etc/apparmor.d/bin.ping to enforce mode.
...