Earth Notes: On Website Technicals (2024-06)

Updated 2024-06-18 14:07 GMT.
By Damon Hart-Davis.
Tech updates: byte trimming, Googlebot still rogue, bot funnel, defence trimming, attacked, link saturation, slow winter...
tools 800w JA
My paper was published last month, so one of my jobs this month is helping get the message out there!

2024-06-18: RSS Winter Slower Poll

I am meant to be working through a huge UK government consultation, so I am instead being creative with displacing my attention elsewhere.

I have added a clause (4) that is only active in winter when energy is short; the default cache/retry time becomes just under 1 day. This and clause (2) try to drift clients towards early afternoon for polling.

# Set RSS feed cache time, ie minimum poll interval.
# Give podcast RSS and similar feeds longer expiry out of work hours.
# Usually have any Retry-After match what the expiry would have been.
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
    # In skipHours.
    # Long enough to jump out of skipHours in one go.
    ExpiresByType application/rss+xml "access plus 10 hours 7 minutes"
    Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT
    # Reduce Atom feed processing at night too.
    ExpiresByType application/atom+xml "access plus 10 hours 7 minutes"
</If>
<ElseIf "%{TIME_HOUR} -lt 12">
    # Coming up to the open hour at 12XXZ.
    # Normal general expiry time of ~4h (>>1h).
    # But try to funnel bad bots getting 429/503 to the noon 'open' slot.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
    Header always set Retry-After "3420" env=REDIRECT_RSS_RATE_LIMIT
</ElseIf>
<ElseIf "%{TIME_HOUR} -gt 17">
    # Coming up to the start of skipHours.
    # Jump expiry right over coming skipHours block.
    ExpiresByType application/rss+xml "access plus 14 hours 7 minutes"
    Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT
</ElseIf>
<ElseIf "%{TIME_MON} -lt 3 || %{TIME_MON} -gt 10">
    # Winter: defer next poll just under 1d: creep back to ~noon.
    ExpiresByType application/rss+xml "access plus 23 hours 57 minutes"
    Header always set Retry-After "86220" env=REDIRECT_RSS_RATE_LIMIT
</ElseIf>
<Else>
    # Give podcast RSS and similar feeds a default expiry time of ~4h.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
    Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT
</Else>

Note that this is also selectively applied to Atom feeds, even sitemap.atom, in night-time skipHours.

2024-06-17: RSS Traction?

In A saving bandwidth special! Podnews has done a good write-up of the nonsense that I have been banging on about! I hope that James is OK with me lifting this big block of text, but he is right on the nose (my emphasis):

Spotted by Tom Rossi of Buzzsprout, Spotify has implemented “conditional GET requests” for RSS feeds using Etags. According to Tom Rossi of Buzzsprout, this has already resulted in 87% less RSS bandwidth being consumed for Spotify and Buzzsprout.

Many podcast hosts support an Etag in RSS headers, which only changes when the file itself has changed. In our RSS feed, it’s worked out using the last update time. Spotify sends the last Etag it saw in an if-none-match request; if it’s the same as the current Etag we see, we just give them a “304” code (“not modified”) in return, and do not generate and send the whole RSS feed. This saves processing power for both of us, as well as significant bandwidth: we get 206 checks a day from Spotify to our 32KB RSS feed, but only publish once a day. This change will result in savings for us of 99.5% of the 6.5MB RSS bandwidth we use to Spotify alone. (That’s just for one show!)

Not every podcast directory supports if-none-match. AntennaPod, PodcastRepublic, Gaana, Amazon, Slack, PodcastAddict and many others don’t send that header, resulting in significantly more bandwidth for us (and for them). We’re thankful for Spotify for turning it on.

Another way to save bandwidth on RSS feeds is to ensure they use gzip or brotli compression. RSS feeds are highly compressible, and compressing our podcast feeds should save 89% of our RSS bandwidth bill. Surprisingly, not every podcast host supports compression; but most do. If you’re a podcast host, you can test over here with the URL of one of your RSS feeds.

Podnews wasn’t supporting Gzip for our RSS feed (a mistake); it was accounting for 3.86GB of data per day. We turned it on last week; our bandwidth bill for our RSS feed has more than halved, and is now 1.22GB. We also now fetch feeds with gzip where we can.

Brotli is better than Gzip. As of today, Podnews supports both Brotli and Gzip compression. Cloudfront supports both.

A third way to save bandwidth is to use the alternateEnclosure in Podcasting 2.0 RSS feeds to offer a lower-bitrate version of audio to podcast players that support it. PodLP does; by default, it uses the Opus version of our audio (762KB) rather than our mp3 (7.1MB). That means we (and listeners in developing countries) get an 89% saving in audio bandwidth, unless they choose to get the MP3 version.

You’ll need to generate the additional low-bitrate version, of course. We do that automatically on publishing.

Hurrah!

I sent James an email with something that he might add in future: Note that on Apache 2.4 as server, ETag basically does not work without huge care and/or messy workarounds. So Last-Modified / If-Modified-Since is the thing to use.

2024-06-15: Continuous Attack

My tiny host looking after the Thermino is under noticeable strain this morning. Nearly 200k entries in the auth.log over the last week or so:

...
Jun 15 07:29:05 pekoe sshd[2660]: Disconnected from authenticating user root 106.14.149.85 port 54710 [preauth]
Jun 15 07:29:05 pekoe sshd[2662]: Disconnected from authenticating user root 43.159.145.80 port 55926 [preauth]
Jun 15 07:29:05 pekoe sshd[2658]: Received disconnect from 111.70.13.212 port 35544:11: Bye Bye [preauth]
Jun 15 07:29:05 pekoe sshd[2658]: Disconnected from authenticating user root 111.70.13.212 port 35544 [preauth]
Jun 15 07:29:06 pekoe sshd[2673]: Invalid user guest from 167.71.229.36 port 48808
Jun 15 07:29:06 pekoe sshd[2672]: Invalid user debian from 138.204.127.54 port 44399
Jun 15 07:29:06 pekoe sshd[2673]: Received disconnect from 167.71.229.36 port 48808:11: Bye Bye [preauth]
Jun 15 07:29:06 pekoe sshd[2673]: Disconnected from invalid user guest 167.71.229.36 port 48808 [preauth]
Jun 15 07:29:06 pekoe sshd[2672]: Received disconnect from 138.204.127.54 port 44399:11: Bye Bye [preauth]
Jun 15 07:29:06 pekoe sshd[2672]: Disconnected from invalid user debian 138.204.127.54 port 44399 [preauth]
Jun 15 07:29:07 pekoe sshd[2677]: Invalid user lixiang from 43.135.134.197 port 56530
Jun 15 07:29:07 pekoe sshd[2659]: Invalid user nokia from 119.28.118.4 port 46672
Jun 15 07:29:07 pekoe sshd[2676]: Invalid user moderator from 138.97.64.134 port 38692
Jun 15 07:29:07 pekoe sshd[2677]: Received disconnect from 43.135.134.197 port 56530:11: Bye Bye [preauth]
Jun 15 07:29:07 pekoe sshd[2677]: Disconnected from invalid user lixiang 43.135.134.197 port 56530 [preauth]
Jun 15 07:29:07 pekoe sshd[2676]: Received disconnect from 138.97.64.134 port 38692:11: Bye Bye [preauth]
Jun 15 07:29:07 pekoe sshd[2676]: Disconnected from invalid user moderator 138.97.64.134 port 38692 [preauth]
...

Outbound bandwidth

I asked someone with a ~1GBps connection to try pulling down a large (~400MB) file from EOU, not expecting to hit my 20Mbps outbound bandwidth, He saw a touch over 2MBps (~3 mins) so it seems that the RPi3B can pretty much saturate the outbound link (16WW has an 80/20 FTTC connection).

2024-06-17: max Mbps

Testing locally, ie without leaving the LAN, best of a couple of tries:

sencha% wget -O /dev/null http://www.earth.org.uk/out/monthly/public-data-files.tar.xz
/dev/null           100%[===================>] 396.06M  10.6MB/s    in 38s
sencha% wget -O /dev/null https://www.earth.org.uk/out/monthly/public-data-files.tar.xz
/dev/null           100%[===================>] 396.06M  10.6MB/s    in 38s
macbook2G% wget -O /dev/null https://www.earth.org.uk/out/monthly/public-data-files.tar.xz
/dev/null           100%[===================>] 396.06M  1.37MB/s    in 4m 47s

The MacBook Air's connection above is over 2.3GHz WiFi, via the router; not even reaching 20Mbps over the LAN!

Trying again over 5G WiFi ... much better:

macbook5G% wget -O /dev/null https://www.earth.org.uk/out/monthly/public-data-files.tar.xz
/dev/null           100%[===================>] 396.06M  10.8MB/s    in 39s

It seems as if the RPi3B could maybe support ~10MBps (~80Mbps) outbound, with no visible encryption penalty.

2024-06-11: Feed Defence Trimming

To make it more reliable to sign up as a new podcast user/listener, or to make an occasional unconditional poll of the feed, I have tweaked things so that you should only get a 429 (or 503) outside skipHours if you are seen to be a bad bot from recent days' traffic.

You may still get randomly pushed back at any time with a 406 for not allowing compression to save bandwidth, whether known bad bot or not, but that is easy to avoid (allow gzip!) and is a good thing to fix.

During skipHours you may still get a 429, bad bot or not, if battery is low or grid is high carbon intensity.

This may warrant reducing the threshold to allow less than one hit per hour at some point, to try to steer those bots to be more climate friendly.

2024-06-12: 5 per hour

As of early (~06:00Z) this morning, so still within skipHours, filtering out the 429 and 503 responses, so leaving almost only 200 and 304 it seems, I was seeing about five requests per hour go through from a variety of sources. That 'good' residue may be acceptable for an audience of ~50 listens which op3.dev claimed (for May).

% egrep rss/podcast /var/log/apache2/other_vhosts_access.log | egrep -v '" ((503)|(429)) ' | tail
[12/Jun/2024:05:51:18 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11924 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
[12/Jun/2024:05:56:19 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11924 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
[12/Jun/2024:05:56:33 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 250 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"
[12/Jun/2024:06:00:07 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 113 "-" "PodcastAddict/v5 (+https://podcastaddict.com/; Android podcast app)"
[12/Jun/2024:06:07:33 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12024 "-" "Mozilla/5.0 (compatible; FlipboardRSS/1.2; +http://flipboard.com/browserproxy)"
[12/Jun/2024:06:46:14 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11924 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
[12/Jun/2024:06:52:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 301 3956 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[12/Jun/2024:07:00:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 194 "-" "Aggrivator (PodcastIndex.org)/v0.1.7"
[12/Jun/2024:07:15:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3456 "-" "TPA/1.0.0"
[12/Jun/2024:07:25:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 250 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"

2024-06-10: Feed Bot Funnel

It looks as if Podbean may pay some attention to 503 responses. It made no requests after about 02:00Z until about 11:00Z — bliss! (After solid 503 responses for 4h from 22:00Z.) I might force any client with a significant number of skipHours requests to 503s then, as presumably ignoring gentler rejections such as 429s.

Partly for recovering bad bots, I have added a funnel from before noon, to help funnel activity towards the noon 'open' slot, with a just-below-1h retry time, for any 429 and 503 recipients that pay attention to Retry-After.
    <ElseIf "%{TIME_HOUR} -lt 12">
        # Normal general expiry time (>>1h).
        # But try to funnel bad bots getting 429/503 to the noon 'open' slot.
        ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
        # For 429.
        Header always set Retry-After "3420" env=REDIRECT_RSS_RATE_LIMIT
        # For 503.
        Header always set Retry-After "3420" env=RSS_RATE_LIMIT
    </ElseIf>

It occurs to me that rather than the extra processing to hand 503s to especially bad bots, especially in skipHours, a stateless mechanism could probably switch randomly between returning 503 and 429 in hourly blocks. Hourly is big enough that a retry after a minute as is typical for these bots will usually get the same response code. If the bad bot pays attention to the Retry-After in either, or just backs off more anyway, then the result would probably be similar. This could be done in the main 429 skipHours block, with careful use of Skip. That would reduce the amount of processing potentially done for every request.

Other 304s

I note from the GSC that nearly half of the requests to the m-dot EOU site are receiving 304 responses. In part this may be to understanding the whole Apache ETag bug better. Also I have resisted site-wide updates recently, with most such pages untouched since .

2024-06-08: Googlebot Still Rogue

Googlebot is still fetching the RSS podcast feed file nearly 1000 times per day on average. Next is iTunes, previously villain of the piece, at ~200.

I have tweaked the greedy MD5 flags to have those for the top three by hits to be non-zero size, which potentially allows stiffer action against them in the Apache configuration.

I intend to return 503s rather than 429s (with the same Retry-After value) to those bad bots during at least skipHours. If this means that Googlebot spiders the rest of the EOU site too less at such times, that might be a good thing. I have already explicitly configured Bing to ease up crawling overnight.

Google Search Console (GSC) does not seem to show 429, so maybe they are (maybe badly) handled at a lower level in the code.

Initial (daylight) testing suggests that both Podbean and Googlebot come back within about 60 seconds of a 503 also, so this extra complexity and tier of push-back may not achieve anything, other than slowing down all requests a little... Note that the Retry-After header was not being set for the 503 responses.

[08/Jun/2024:17:04:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:05:04 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:05:08 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3335 "-" "iTMS"
[08/Jun/2024:17:05:09 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3335 "-" "iTMS"
[08/Jun/2024:17:05:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 15450 "-" "itms"
[08/Jun/2024:17:05:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 860 "-" "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio"
[08/Jun/2024:17:05:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:06:07 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:07:09 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:07:34 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4259 "-" "Spotify/1.0"
[08/Jun/2024:17:07:40 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:07:48 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:08:40 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:08:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:09:41 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:09:54 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:10:41 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:10:57 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1"
[08/Jun/2024:17:12:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[08/Jun/2024:17:12:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3599 "-" "Mozilla/5.0"

The new config is in part:

RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
RewriteCond %{HTTP_REFERER} ^$
# Bot's hashed UA appears in flags dir and is non-zero size?
RewriteCond expr "-s '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'"
RewriteRule "^/rss/.*\.rss$" - [L,R=503,E=RSS_RATE_LIMIT:1]
#...
    <If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
        # This should be long enough to jump out of skipHours in one go.
        ExpiresByType application/rss+xml "access plus 10 hours 7 minutes"
        # For 429.
        Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT
        # For 503.
        Header always set Retry-After "36420" env=RSS_RATE_LIMIT
    </If>
    <ElseIf "%{TIME_HOUR} -gt 17">
        # Jump expiry right over coming skipHours block.
        ExpiresByType application/rss+xml "access plus 14 hours 7 minutes"
        # For 429.
        Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT
        # For 503.
        Header always set Retry-After "50620" env=RSS_RATE_LIMIT
    </ElseIf>
    <Else>
        # Give podcast RSS and similar feeds a default expiry time of ~4h.
        ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
        # For 429.
        Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT
        # For 503.
        Header always set Retry-After "14720" env=RSS_RATE_LIMIT
    </Else>

A quick verification (with the time limits temporarily adjusted):

% wget -U "iTMS" -S -O /dev/null --compress=auto https://www.earth.org.uk/rss/podcast.rss
--2024-06-08 21:30:53--  https://www.earth.org.uk/rss/podcast.rss
Resolving www.earth.org.uk (www.earth.org.uk)... 79.135.97.78
Connecting to www.earth.org.uk (www.earth.org.uk)|79.135.97.78|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 503 Service Unavailable
  Date: Sat, 08 Jun 2024 20:30:54 GMT
  Server: Apache
  Retry-After: 50620
  Content-Length: 299
  Connection: close
  Content-Type: text/html; charset=iso-8859-1
2024-06-08 21:30:54 ERROR 503: Service Unavailable.

I have this evening directly reported the issue (as well as via the GSC console and to developer/SEO liaison folks):

Googlebot is generally sensible, but is pulling rss/podcast.rss every minute, even if given a 429 or 503 response code with a long Retry-After delta-seconds header.

2024-06-09: continued

Even with hours of the Retry-After header on 503s, Googlebot is just as insistent:

[09/Jun/2024:05:36:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[09/Jun/2024:05:36:42 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4259 "-" "Spotify/1.0"
[09/Jun/2024:05:37:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[09/Jun/2024:05:38:39 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[09/Jun/2024:05:39:05 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3456 "-" "gPodder/3.11.4 (+http://gpodder.org/) Windows"
[09/Jun/2024:05:39:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 860 "-" "Amazon Music Podcast"
[09/Jun/2024:05:39:39 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[09/Jun/2024:05:41:11 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 113 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
[09/Jun/2024:05:41:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[09/Jun/2024:05:42:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

And Podbean is differently broken, maybe less bad than with 429s (each block of 4 fetches is from a different IP address):

[09/Jun/2024:02:03:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1"
[09/Jun/2024:02:04:53 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1"
[09/Jun/2024:02:05:56 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1"
[09/Jun/2024:02:06:59 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1"
...
[09/Jun/2024:04:04:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1"
[09/Jun/2024:04:05:54 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 389 "-" "Podbean/FeedUpdate 2.1"
[09/Jun/2024:04:06:57 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 389 "-" "Podbean/FeedUpdate 2.1"
[09/Jun/2024:04:07:59 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 389 "-" "Podbean/FeedUpdate 2.1"

Note that this is distinct from the somewhat better-behaved FeedFetcher-Google:

[06/Jun/2024:02:55:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13424 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
[06/Jun/2024:07:23:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 301 3956 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
[06/Jun/2024:07:23:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4144 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
[06/Jun/2024:08:23:50 +0000] "GET /rss/podcast.rss HTTP/1.1" 301 3956 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
[06/Jun/2024:08:23:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13424 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
[09/Jun/2024:09:34:44 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 10005 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"

2024-06-03: Bytes Trimmed

While I would like the error page for 404 to be human friendly, it is a shame to waste bandwidth on errors that humans will almost never see, and which may not even be compressed on the wire!

(Though I would like techies that see even the latter to be helped by them.)

I just realised that the mobile versions are still more or less styled right even for the desktop site, and are smaller:

% ls -alS {,m/}429.html{,gz,br}
1025 429.html
 874 m/429.html
 578 429.htmlgz
 527 m/429.htmlgz
 416 429.htmlbr
 385 m/429.htmlbr

So for minimal extra effort I have told Apache to use the mobile versions for 406 and 429:

# Serve the 'mobile' versions of some error pages to save a few bytes.
# (Often no human will look at them anyway.)
# Custom 404 page.
ErrorDocument 404 /404.html
# Custom 406 page.
ErrorDocument 406 /m/406.html
# Custom 429 page.
ErrorDocument 429 /m/429.html

At least in some cases this will be defeated by anti-traffic-analysis padding in TLS.

Which would all be lovely, except that it does not work, and gets turned into a 301 permanent redirect to the specified HTML pages. Close, but no cigar!

So I have done the boring thing instead, and trimmed the messages to a minimum!

% ls -alS {,m/}429.html{,gz,br}
881 429.html
730 m/429.html
505 429.htmlgz
452 m/429.htmlgz
392 429.htmlbr
327 m/429.htmlbr

Here is a before and after for Googlebot; 24 bytes of Brotli-ed response saved, nominally:

[03/Jun/2024:16:19:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...
[03/Jun/2024:16:22:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4144 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
~1758 words.