Monday, March 31, 2025

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text.

Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today.

A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages.

The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and expect it to work. Parsing the UTF8 encoding scheme can be irritatingly problematic or even have security implications.

sequences = [
    b'\x41\x42\x43',  # valid in Latin1 and UTF8
    b'\xe2\x82\xac',  # valid Latin1 and UTF8
    b'\x80\x81\x82',  # valid in Latin1, invalid in UTF8 
    b'\x41\x42\x80',  # valid in Latin1, invalid in UTF8
]

def decode(sequences):
    for seq in sequences:
        print(f"Decoding: {seq}")

        try:
            latin1_decoded = seq.decode('latin1')
            print(f"  Decoded with Latin1: {latin1_decoded}")
        except Exception as e:
            print(f"  Error decoding with Latin1: {e}")

        try:
            utf8_decoded = seq.decode('utf-8')
            print(f"  Decoded with UTF8: {utf8_decoded}")
        except Exception as e:
            print(f"  Error decoding with UTF8: {e}")

decode(sequences)

$ python3 latin.py
Decoding: b'ABC'
  ...with Latin1: ABC
  ...with UTF8: ABC
Decoding: b'\xe2\x82\xac'
  ...with Latin1: รข‚¬
  ...with UTF8: €
Decoding: b'\x80\x81\x82'
  ...with Latin1: €‚
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Decoding: b'AB\x80'
  ...with Latin1: AB€
  Error decoding with UTF8: 'utf-8' codec can't decode byte 0x80 in position 2: invalid start byte

Sunday, March 30, 2025

Too much efficiency makes everything worse

From "Overfitting and the strong version of Goodhart's law":
Increased efficiency can sometimes, counterintuitively, lead to worse outcomes. This is true almost everywhere. We will name this phenomenon the strong version of Goodhart's law. As one example, more efficient centralized tracking of student progress by standardized testing seems like such a good idea that well-intentioned laws mandate it. However, testing also incentivizes schools to focus more on teaching students to test well, and less on teaching broadly useful skills. As a result, it can cause overall educational outcomes to become worse. Similar examples abound, in politics, economics, health, science, and many other fields.

[...] This same counterintuitive relationship between efficiency and outcome occurs in machine learning, where it is called overfitting. [...] If we keep on optimizing the proxy objective, even after our goal stops improving, something more worrying happens. The goal often starts getting worse, even as our proxy objective continues to improve. Not just a little bit worse either — often the goal will diverge towards infinity.

This is an extremely general phenomenon in machine learning. It mostly doesn’t matter what our goal and proxy are, or what model architecture we use. If we are very efficient at optimizing a proxy, then we make the thing it is a proxy for grow worse.

Saturday, March 29, 2025

Bootstrapping and the Central Limit Theorem

If you've ever seen a data visualization, you've probably seen a Bell Curve or a normal distribution. But this emergent property of many data visualizations is actually a result of the law of large numbers and the central limit theorem.

The central limit theorem tells us that the distribution of a normalized version of any sample mean will eventually converge to a standard normal distribution.

For example, let's say that we wish to chart the first fifty most popular science-fiction books on Goodreads by the number of pages they contain.

Our initial sample will look something like this:

pageCounts = np.array([
    324, 216, 384, 194, 480, 368, 374, 268, 244, 258, 
    476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
    382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
    250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])

If we want to plot our original sample of books, we could do something like:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pageCounts = np.array([
    324, 216, 384, 194, 480, 368, 374, 268, 244, 258,
    476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
    382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
    250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])

plt.figure(figsize=(7, 5))
sns.histplot(page_counts, bins=10, kde=False, color='#1f77b4', edgecolor='black')
plt.title('Histogram of Book Pages')
plt.xlabel('Page Count')
plt.ylabel('Frequency')
plt.savefig("histogram.jpg", dpi=300, bbox_inches='tight')
plt.close()

This will produce a chart like:

But if we want to normalize and bootstrap our dataset, we will have to resample it. Replacement sampling, which we will use in this example, works like this. Let us say that we have a data set of only:

pageCounts = np.array([
	216, 324, 385
])

The resampling process will randomly sample from this set. For example:

  • Resample #1: [216, 324, 324] -> mean = 288.0
  • Resample #2: [385, 385, 216] -> mean = 328.67
  • Resample #3: [324, 216, 216] -> mean = 252.0

If we repeat this process many times, the distribution of our resampled means will approximate a normal distribution, as predicted by the Central Limit Theorem. We can append the following Python code to bootstrap our dataset and graph it:

np.random.seed(42)
num_samples = 10000
bootstrap_means = np.random.choice(page_counts, (num_samples, len(page_counts)),
replace=True).mean(axis=1)

plt.figure(figsize=(7, 5))
sns.histplot(bootstrap_means, bins=30, kde=True, color='#ff7f0e', edgecolor='black')
plt.title('Bootstrapped Distribution of Page Counts')
plt.xlabel('Mean Page Count')
plt.ylabel('Frequency')
plt.savefig("bootstrapped_distribution.jpg", dpi=300, bbox_inches='tight')  
plt.close() 

This process is extremely useful for both modeling and hypothesis testing. If we want to make a claim about a dataset, such as page counts of science fiction books — but we only get a small sample of science fiction books to work with—we can use bootstrapping to generate many simulations of the dataset and sample the distribution of the statistic we want to inquire about.

It's important to note that resampling isn't done to estimate the distribution—our sample itself already represents a data model. In this case, it represents page counts of science fiction books.

Rather, by resampling, we approximate the sampling distribution of a given statistic, such as the mean. This may allow us to make inferences about the broader dataset, even when the original sample size is small.

For example, we could additionally assess confidence intervals, which we'll discuss in a future post.

Wednesday, March 26, 2025

Unlearning, or Proof by Contradiction

Sometimes, we have to unlearn the things we initially learned. And I don't mean this in the sense of having been deliberately deceived. Rather, I mean that to some extent, there are actually many situations in life that involve necessary lies—or believing things that are wrong for perfectly rational reasons. Sometimes it is only after we have consumed and digested such a falsehood that we can see the truth at all. Really, this form of learning is not unlike some parts of math.

Consider a mathematical proof in which we begin by assuming that something is one way. But by the end of the proof, we may realize, through contradiction, that it's actually another way.

Let us take the number 2 and generously hypothesize that the square root of 2 is actually rational. If this assumption were true, we should be able to prove it with an equation. Let the square root of 2 be the lowest form of $\frac{p}{q}$.

Since squares of even numbers are even, and squares of odd numbers are odd, it follows that in order to get back to two, we would have to raise both p and q to the power of 2, like this:

\[ 2 = \frac{p^2}{q^2} \]

If we multiply both sides by $q^2$, we can get rid of the denominator. Now we get $ p^2 = 2q^2 $. From here, we must infer that $p$ is an even number.

With our generous assumption that $p$ is even, let us substitute $p$ for $2r$, where r is an integer. Let us test our hypothesis with the following equation, simplifying both sides:

\[ (2r)^2 = 2q^2 \] \[ 4r^2 = 2q^2 \]

Uh oh. Now we've hit a snag. From here, if we attempt to divide by two on both sides, we get:

\[ 2r^2 = q^2 \]

And we cannot further divide or simplify our equation. At least, we can not do so within the realm of rational numbers!

How can this be? Remember, our initial hypothesis that the square root of 2 was rational rested on the assumption that $\frac{p}{q}$ was in its lowest form. But now here we see that $2r^2$ is equal to $q^2$. In other words, both $p$ and $q$ are divisible by 2—which contradicts our original claim that $\frac{p}{q}$ was in lowest terms.

This means they still share a common factor. Thus, neither side is in its lowest form. Proof by contradiction that the square root of two is not rational after all.

Sunday, March 16, 2025

Patterns and the Stock Market

On the random walk hypothesis and post-hoc explanations for describing natural processes, from "Patterns and the Stock Market":

While it's certainly entertaining to spin post-hoc explanations of market activity, it's also utterly futile. The market, after all, is a classic example of a "random walk," since the past movement of any particular stock cannot be used to predict its future movement. This inherent randomness was first proposed by the economist Eugene Fama, in the early 1960's. Fama looked at decades of stock market data in order to prove that no amount of rational analysis or knowledge (unless it was illicit insider information) could help you figure out what would happen next. All of the esoteric tools and elaborate theories used by investors to make sense of the market were pure nonsense. Wall Street was like a slot machine.

Alas, the human mind can't resist the allure of explanations, even if they make no sense. We're so eager to find correlations and causation that, when confronted with an inherently stochastic process - like the DJIA, or a slot machine - we invent factors to fixate on. The end result is a blinkered sort of overconfidence, in which we're convinced we've solved a system that has no solution.

"Is an economic recession a divergence from the market trend that eventually reverses over time? Or is it more analogous to a random walk?" asked the student.

The master struck him on the head with a walking stick. "Economic recessions are primarily qualitative; attempting to measure them is meaningless."

Quantitative fundamentals play a role in shaping market dynamics, but in a Bayesian spirit, so too does information and discourse circulating within those markets. "It's priced in," is an expression people use to describe the way that the market's dynamic is not just a sum of monetary fundamentals but also a sum of qualitative sentiment, which is far more difficult to quantify.

Sure, you could try to measure a market recession with GDP output, but if you want an even better attempt at understanding one, you may have greater success communicating with actual market participants—but even then, perspectives will be wildly subjective.

Recessions aside—one could also attempt, at any time, to use technical analysis to predict short-term stock prices. But you may also simply end up straining at gnats.

Furthermore, if someone is claiming to know the true reasons why the market went up or down, there's a significant possibility they are lying. Either by not knowing any better, or deliberately lying.

If they actually knew—if they actually possessed that knowledge—they would have used it to make money.

Wednesday, March 12, 2025

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format.

AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019.[1]

yt-dlp, the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging:

$ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best" https://www.youtube.com/watch?v=dQw4w9WgXcQ

For example, the format command above does not actually properly extract the best possible formats for all YouTube urls on my OSX machine.

This usually happens in cases where a YouTube URL tries to serve a .webm file. If you were to try using the above format flag to attempt extracting the best quality mp4 compatible audio and video from a list of youtube urls -- and you come across a YouTube url that serves a .webm file -- yt-dlp won't error out, abort, or skip the url. Instead, yt-dlp will extract and generate video that's improperly formatted -- .mp4 files that cannot be opened or played.

However, we can fix this problem without even bothering yt-dlp with a pull request. Because yt-dlp does give us the capability to dump out all of the possible audio and video formats available for any video by using the -F flag:

$ yt-dlp -F "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
[youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading tv client config
[youtube] dQw4w9WgXcQ: Downloading player b21600d5
[youtube] dQw4w9WgXcQ: Downloading tv player API JSON
[youtube] dQw4w9WgXcQ: Downloading ios player API JSON
[youtube] dQw4w9WgXcQ: Downloading m3u8 information
[info] Available formats for dQw4w9WgXcQ:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC          VBR ACODEC      ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb3 mhtml 48x27        0    │                  mhtml │ images                                  storyboard
sb2 mhtml 80x45        1    │                  mhtml │ images                                  storyboard
sb1 mhtml 160x90       1    │                  mhtml │ images                                  storyboard
sb0 mhtml 320x180      1    │                  mhtml │ images                                  storyboard
233 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
234 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
249 webm  audio only      2 │    1.18MiB   46k https │ audio only          opus        46k 48k [en] low, webm_dash
250 webm  audio only      2 │    1.55MiB   61k https │ audio only          opus        61k 48k [en] low, webm_dash
140 m4a   audio only      2 │    3.27MiB  130k https │ audio only          mp4a.40.2  130k 44k [en] medium, m4a_dash
251 webm  audio only      2 │    3.28MiB  130k https │ audio only          opus       130k 48k [en] medium, webm_dash
602 mp4   256x144     13    │ ~  2.04MiB   81k m3u8  │ vp09.00.10.08   81k video only
269 mp4   256x144     25    │ ~  3.95MiB  156k m3u8  │ avc1.4D400C    156k video only
160 mp4   256x144     25    │    1.78MiB   70k https │ avc1.4d400c     70k video only          144p, mp4_dash
...
270 mp4   1920x1080   25    │ ~123.87MiB 4902k m3u8  │ avc1.640028   4902k video only
//snipped

It turns out it's actually much better to first manually list the formats this way, use grep and awk to extract the best possible codecs for an mp4 file, and then run yt-dlp with the specifically related codecs for each video URL. Here's a Bash script to automate this process, which makes downloading stuff from YouTube easier, in my opinion:

#!/bin/bash

if [ -z "$1" ]; then
    echo "Usage: $0 <youtube_url>"
    exit 1
fi

url="$1"

processVideo() {
    local videoUrl="$1"

    echo "Fetching available formats for video: $videoUrl"
    formats=$(yt-dlp -F "$videoUrl")
    if [ $? -ne 0 ]; then
        echo "Error: Failed to fetch formats for $videoUrl. Is yt-dlp installed and the URL valid?"
        return
    fi

    videoFormat=$(echo "$formats" | grep 'mp4' | grep -E 'avc1' | \
    awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}' | \
    sort -k2 -nr | awk '{print $1}' | head -1)

    if [ -z "$videoFormat" ]; then
        echo "No AVC1 video format found, falling back to any MP4 format."
        videoFormat=$(echo "$formats" | grep 'mp4' | \
        awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}' | \
        sort -k2 -nr | awk '{print $1}' | head -1)
    fi

    audioFormat=$(echo "$formats" | grep 'm4a' | \
    awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}' | \
    sort -k2 -nr | awk '{print $1}' | head -1)

    if [ -z "$videoFormat" ] || [ -z "$audioFormat" ]; then
        echo "Error: No compatible MP4 video or M4A audio formats found for $videoUrl!"
        return
    fi

    echo "Selected video format: $videoFormat [MP4 : AVC1 preferred]"
    echo "Selected audio format: $audioFormat [M4A : highest quality]"

    echo "Downloading video with yt-dlp..."
    yt-dlp --restrict-filenames \
    -f "${videoFormat}+${audioFormat}" \
    --merge-output-format mp4 "$videoUrl"

    if [ $? -ne 0 ]; then
        echo "Error: Failed to download video. Check the format IDs and URL."
    fi
}

isPlaylist() {
    if echo "$url" | grep -q "list="; then
        return 0 
    else
        return 1 
    fi
}

if isPlaylist; then
    echo "Processing playlist..."
    videoUrls=$(yt-dlp --flat-playlist --get-url "$url")

    if [ -z "$videoUrls" ]; then
        echo "Error: No videos found in the playlist. Is the URL correct?"
        exit 1
    fi

    for videoUrl in $videoUrls; do
        echo "Processing video: $videoUrl"
        processVideo "$videoUrl"
    done
else
    echo "Processing single video..."
    processVideo "$url"
fi

We grab the entire "available formats" table as input, storing it as plaintext in the $formats variable. We then grep $formats for 'mp4' listings, then grep again, further filtering for listings that use the AVC1 H.264 codec. If it doesn't find AVC1, we fall back to simply whatever is MP4 compatible. After filtering twice with grep, our list looks something like this:

269 mp4   256x144     25    | ~  3.95MiB  156k m3u8  | avc1.4D400C    156k video only
160 mp4   256x144     25    |    1.78MiB   70k https | avc1.4d400c     70k video only          144p, mp4_dash
229 mp4   426x240     25    | ~  5.73MiB  227k m3u8  | avc1.4D4015    227k video only
133 mp4   426x240     25    |    2.88MiB  114k https | avc1.4d4015    114k video only          240p, mp4_dash
230 mp4   640x360     25    | ~ 12.09MiB  478k m3u8  | avc1.4D401E    478k video only
134 mp4   640x360     25    |    5.42MiB  214k https | avc1.4d401e    214k video only          360p, mp4_dash
18  mp4   640x360     25  2 | ≈  8.68MiB  343k https | avc1.42001E         mp4a.40.2       44k [en] 360p
231 mp4   854x480     25    | ~ 16.69MiB  660k m3u8  | avc1.4D401E    660k video only
135 mp4   854x480     25    |    8.28MiB  328k https | avc1.4d401e    328k video only          480p, mp4_dash
232 mp4   1280x720    25    | ~ 28.59MiB 1131k m3u8  | avc1.4D401F   1131k video only
136 mp4   1280x720    25    |   16.01MiB  633k https | avc1.4d401f    633k video only          720p, mp4_dash
270 mp4   1920x1080   25    | ~123.87MiB 4902k m3u8  | avc1.640028   4902k video only
137 mp4   1920x1080   25    |   76.46MiB 3025k https | avc1.640028   3025k video only          1080p, mp4_dash
//snipped

Then we use a for statement with awk and NF to loop through all of the fields, parsing the ID and TBR columns. The TBR column contains the bitrate. awk helps to extract the bitrate from the tbr table column, the first field the parser sees ending with a lowercase "k.":

awk '{for (i=1; i<=NF; i++) if ($i ~ /k$/) tbr=$i; print $1, tbr}'

At this point, our output looks something like this -- just a list of mp4 IDs and bitrates from our AVC1 list:

269 135k
160 66k
230 565k
134 353k
232 2396k
...
137 3025k
270 4902k
//snipped

Afterward, we use sort to further select for the listing with the highest bitrate -- then awk and head -1 to ensure we print back only the ID of the mp4 video file listing with the highest bitrate.

sort -k2 -nr | awk '{print $1}' | head -1)

Our final output is just 270, the ID, which is what we pass to yt-dlp for the video portion of the download.

We repeat the process for the audio file listings by grepping for lines containing the m4a format extension. Again, we print the ID and TBR bitrate columns, sorting and extracting the related ID for the audio file with the highest bitrate.

We pass both the high quality video and audio IDs to yt-dlp for downloading. yt-dlp automagically merges these two files to produce a finalized MP4.

You could modify the grep and awk statements any other preferred video format, but this bash script works for downloading lectures I can natively watch and listen to on OSX. Here's the default yt-dlp package listing the available video formats, and below is an example of our Bash script that uses yt-dlp to help us extract the highest quality AVC1 MP4 files and make a portable, high quality video.

% yt-dlp -F "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
[youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading tv client config
[youtube] dQw4w9WgXcQ: Downloading player 6b3caec8
[youtube] dQw4w9WgXcQ: Downloading tv player API JSON
[youtube] dQw4w9WgXcQ: Downloading ios player API JSON
[youtube] dQw4w9WgXcQ: Downloading m3u8 information
[info] Available formats for dQw4w9WgXcQ:
ID  EXT   RESOLUTION FPS CH │   FILESIZE   TBR PROTO │ VCODEC          VBR ACODEC      ABR ASR MORE INFO
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
sb3 mhtml 48x27        0    │                  mhtml │ images                                  storyboard
sb2 mhtml 80x45        1    │                  mhtml │ images                                  storyboard
sb1 mhtml 160x90       1    │                  mhtml │ images                                  storyboard
sb0 mhtml 320x180      1    │                  mhtml │ images                                  storyboard
233 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
234 mp4   audio only        │                  m3u8  │ audio only          unknown             [en] Default
249 webm  audio only      2 │    1.18MiB   46k https │ audio only          opus        46k 48k [en] low, webm_dash
250 webm  audio only      2 │    1.55MiB   61k https │ audio only          opus        61k 48k [en] low, webm_dash
140 m4a   audio only      2 │    3.27MiB  130k https │ audio only          mp4a.40.2  130k 44k [en] medium, m4a_dash
251 webm  audio only      2 │    3.28MiB  130k https │ audio only          opus       130k 48k [en] medium, webm_dash
602 mp4   256x144     13    │ ~  2.04MiB   81k m3u8  │ vp09.00.10.08   81k video only
269 mp4   256x144     25    │ ~  3.95MiB  156k m3u8  │ avc1.4D400C    156k video only
160 mp4   256x144     25    │    1.78MiB   70k https │ avc1.4d400c     70k video only          144p, mp4_dash
603 mp4   256x144     25    │ ~  3.88MiB  154k m3u8  │ vp09.00.11.08  154k video only
278 webm  256x144     25    │    2.29MiB   91k https │ vp9             91k video only          144p, webm_dash
394 mp4   256x144     25    │    1.41MiB   56k https │ av01.0.00M.08   56k video only          144p, mp4_dash
229 mp4   426x240     25    │ ~  5.73MiB  227k m3u8  │ avc1.4D4015    227k video only
133 mp4   426x240     25    │    2.88MiB  114k https │ avc1.4d4015    114k video only          240p, mp4_dash
604 mp4   426x240     25    │ ~  7.26MiB  287k m3u8  │ vp09.00.20.08  287k video only
242 webm  426x240     25    │    3.72MiB  147k https │ vp9            147k video only          240p, webm_dash
395 mp4   426x240     25    │    2.77MiB  109k https │ av01.0.00M.08  109k video only          240p, mp4_dash
230 mp4   640x360     25    │ ~ 12.09MiB  478k m3u8  │ avc1.4D401E    478k video only
134 mp4   640x360     25    │    5.42MiB  214k https │ avc1.4d401e    214k video only          360p, mp4_dash
18  mp4   640x360     25  2 │ ≈  8.68MiB  343k https │ avc1.42001E         mp4a.40.2       44k [en] 360p
605 mp4   640x360     25    │ ~ 14.26MiB  564k m3u8  │ vp09.00.21.08  564k video only
243 webm  640x360     25    │    6.32MiB  250k https │ vp9            250k video only          360p, webm_dash
396 mp4   640x360     25    │    4.85MiB  192k https │ av01.0.01M.08  192k video only          360p, mp4_dash
231 mp4   854x480     25    │ ~ 16.69MiB  660k m3u8  │ avc1.4D401E    660k video only
135 mp4   854x480     25    │    8.28MiB  328k https │ avc1.4d401e    328k video only          480p, mp4_dash
606 mp4   854x480     25    │ ~ 19.74MiB  781k m3u8  │ vp09.00.30.08  781k video only
244 webm  854x480     25    │    8.92MiB  353k https │ vp9            353k video only          480p, webm_dash
397 mp4   854x480     25    │    8.18MiB  324k https │ av01.0.04M.08  324k video only          480p, mp4_dash
232 mp4   1280x720    25    │ ~ 28.59MiB 1131k m3u8  │ avc1.4D401F   1131k video only
136 mp4   1280x720    25    │   16.01MiB  633k https │ avc1.4d401f    633k video only          720p, mp4_dash
609 mp4   1280x720    25    │ ~ 29.81MiB 1180k m3u8  │ vp09.00.31.08 1180k video only
247 webm  1280x720    25    │   14.65MiB  580k https │ vp9            580k video only          720p, webm_dash
398 mp4   1280x720    25    │   14.98MiB  593k https │ av01.0.05M.08  593k video only          720p, mp4_dash
270 mp4   1920x1080   25    │ ~123.87MiB 4902k m3u8  │ avc1.640028   4902k video only
137 mp4   1920x1080   25    │   76.46MiB 3025k https │ avc1.640028   3025k video only          1080p, mp4_dash
614 mp4   1920x1080   25    │ ~ 71.55MiB 2831k m3u8  │ vp09.00.40.08 2831k video only
248 webm  1920x1080   25    │   39.24MiB 1552k https │ vp9           1552k video only          1080p, webm_dash
399 mp4   1920x1080   25    │   27.67MiB 1095k https │ av01.0.08M.08 1095k video only          1080p, mp4_dash
616 mp4   1920x1080   25    │ ~144.16MiB 5704k m3u8  │ vp09.00.40.08 5704k video only          Premium
% ./yt.sh "https://www.youtube.com/watch?v=dQw4w9WgXcQ" 
Processing single video...
Fetching available formats for video: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Selected video format: 270 [MP4 : AVC1 preferred]
Selected audio format: 140 [M4A : highest quality]
Downloading video with yt-dlp...
[youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
[youtube] dQw4w9WgXcQ: Downloading webpage
[youtube] dQw4w9WgXcQ: Downloading tv client config
[youtube] dQw4w9WgXcQ: Downloading player 6b3caec8
[youtube] dQw4w9WgXcQ: Downloading tv player API JSON
[youtube] dQw4w9WgXcQ: Downloading ios player API JSON
[youtube] dQw4w9WgXcQ: Downloading m3u8 information
[info] dQw4w9WgXcQ: Downloading 1 format(s): 270+140
[hlsnative] Downloading m3u8 manifest
[hlsnative] Total fragments: 39
[download] Destination: Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f270.mp4
[download] 100% of   78.70MiB in 00:00:27 at 2.83MiB/s
[download] Destination: Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f140.m4a
[download] 100% of    3.27MiB in 00:00:00 at 4.50MiB/s
[Merger] Merging formats into "Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].mp4"
Deleting original file Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f270.mp4 (pass -k to keep)
Deleting original file Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].f140.m4a (pass -k to keep)

% exiftool Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-\[dQw4w9WgXcQ\].mp4
ExifTool Version Number         : 13.10
File Name                       : Rick_Astley_-_Never_Gonna_Give_You_Up_Official_Music_Video-[dQw4w9WgXcQ].mp4
Directory                       : .
File Size                       : 84 MB
File Modification Date/Time     : 2024:05:30 01:43:41-04:00
File Access Date/Time           : 2024:05:30 01:43:41-04:00
File Inode Change Date/Time     : 2025:03:15 19:30:18-04:00
File Permissions                : -rw-r--r--
File Type                       : MP4
File Type Extension             : mp4
MIME Type                       : video/mp4
Major Brand                     : MP4 Base Media v1 [IS0 14496-12:2003]
Minor Version                   : 0.2.0
Compatible Brands               : isom, iso2, avc1, mp41
Movie Header Version            : 0
Create Date                     : 0000:00:00 00:00:00
Modify Date                     : 0000:00:00 00:00:00
Time Scale                      : 1000
Duration                        : 0:03:32
Preferred Rate                  : 1
Preferred Volume                : 100.00%
Preview Time                    : 0 s
Preview Duration                : 0 s
Poster Time                     : 0 s
Selection Time                  : 0 s
Selection Duration              : 0 s
Current Time                    : 0 s
Next Track ID                   : 3
Track Header Version            : 0
Track Create Date               : 0000:00:00 00:00:00
Track Modify Date               : 0000:00:00 00:00:00
Track ID                        : 1
Track Duration                  : 0:03:32
Track Layer                     : 0
Track Volume                    : 0.00%
Image Width                     : 1920
Image Height                    : 1080
Graphics Mode                   : srcCopy
Op Color                        : 0 0 0
Compressor ID                   : avc1
Source Image Width              : 1920
Source Image Height             : 1080
X Resolution                    : 72
Y Resolution                    : 72
Bit Depth                       : 24
Color Profiles                  : nclx
Color Primaries                 : BT.709
Transfer Characteristics        : BT.709
Matrix Coefficients             : BT.709
Video Full Range Flag           : Limited
Pixel Aspect Ratio              : 1:1
Buffer Size                     : 0
Max Bitrate                     : 3023409
Average Bitrate                 : 3023409
Video Frame Rate                : 25
Matrix Structure                : 1 0 0 0 1 0 0 0 1
Media Header Version            : 0
Media Create Date               : 0000:00:00 00:00:00
Media Modify Date               : 0000:00:00 00:00:00
Media Time Scale                : 44100
Media Duration                  : 0:03:32
Media Language Code             : eng
Handler Description             : ISO Media file produced by Google Inc.
Balance                         : 0
Audio Format                    : mp4a
Audio Channels                  : 2
Audio Bits Per Sample           : 16
Audio Sample Rate               : 44100
Handler Type                    : Metadata
Handler Vendor ID               : Apple
Encoder                         : Lavf61.7.100
Media Data Size                 : 83529233
Media Data Offset               : 171478
Image Size                      : 1920x1080
Megapixels                      : 2.1
Avg Bitrate                     : 3.15 Mbps
Rotation                        : 0

Footnotes

  1. Advanced Video Coding ↩︎

Monday, March 10, 2025

Subshells in Powershell

Previously, I wrote a post about how it's possible to create a "subshell" in Windows analogous to the subshell feature available in Bash on Linux—because Microsoft Windows doesn't actually have native subshell capability the same way that Linux does. The script below is an improvement on the same previous method of using the .NET System.Diagnostics trick. But this new version correctly redirects the standard output:

$x = New-Object System.Diagnostics.ProcessStartInfo
$x.FileName = "cmd.exe"
$x.Arguments = "/c echo %PATH%"
$x.UseShellExecute = $false
$x.RedirectStandardOutput = $true  
$x.EnvironmentVariables.Remove("Path")
$x.EnvironmentVariables.Add("PATH", "C:\custom\path")
$p = New-Object System.Diagnostics.Process
$p.StartInfo = $x
$p.Start() | Out-Null
$output = $p.StandardOutput.ReadToEnd()
$p.WaitForExit()
Write-Output $output

Real-World Example

$customPath2 = "C:\custom\path\2"

$data = @{
    Path = $customPath2  
    Timestamp = Get-Date
    ProcessID = $PID  
}

$x = New-Object System.Diagnostics.ProcessStartInfo
$x.FileName = "cmd.exe"
$x.Arguments = "/c echo %PATH%"
$x.UseShellExecute = $false
$x.RedirectStandardOutput = $true
$x.RedirectStandardError = $true

$data["SubshellError"] = $stderr

$x.EnvironmentVariables.Remove("Path")
$x.EnvironmentVariables.Add("PATH", $customPath2)

$p = New-Object System.Diagnostics.Process
$p.StartInfo = $x
$p.Start() | Out-Null

$output = $p.StandardOutput.ReadToEnd()
$stderr = $p.StandardError.ReadToEnd() 
$p.WaitForExit()

$data["SubshellOutput"] = $output
$data["SubshellError"] = $stderr

$data
$data

Name                           Value
----                           -----
ProcessID                      11852
Path                           C:\custom\path\2
SubshellOutput                 C:\custom\path\2...
SubshellError
Timestamp                      3/10/2025 7:05:01 PM

Using Python To Access archive.today, July 2025

It seems like a lot of the previous software wrappers to interact with archive.today (and archive.is, archive.ph, etc) via the command-line ...