Earth Notes: On Website Technicals (2024-04)

Updated 2024-05-06 16:23 GMT.
By Damon Hart-Davis.
Tech updates: ORCID, RSS work storage, podcast images, transcripts, Apache 2.4 ETag bug, 406 and more 429, less AMP, cacheing tweaks.
tools 800w JA
I am struggling a bit with progressing my PhD currently, but now I have global RSS efficiency as a new side-quest to ensure that I remain appropriately distracted...

2024-04-30: 429 then 406

For EOU RSS traffic I have rearranged the defences to return a 429 in preference to a 406 if both are applicable, since a 429 does seem to slow down Amazon for example. And the Retry-After header may provide more control (than 406) with better behaved clients.

The iTunes iTMS bot continues to get 406s for now (before 07:00Z, battery and grid OK). But a no-User-Agent bot has just received a 429 where previously it might have had a 406.

I also added not allowing gzip compression to the list of sins that may result in a 429 during skipHours.

2024-04-29: RSS Stats

The logs have rolled; time for new stats:

% sh ./prepareStats.sh
INFO: /tmp/stats.out/interval.txt: 2024-04-21T06:25:13 to 2024-04-29T06:25:10 in11477 829577444 02
11539 779247358 03
9576 766272405 04
INFO: /tmp/stats.out/siteHitsByHour.log: site hits by hour (UTC)...
4816 326837195 00
4965 227080097 01
3814 158796652 02
5100 223972908 03
3961 168507529 04
INFO: /tmp/stats.out/feedHits.log: RSS feed hits...
www.earth.org.uk:443 17.58.X.X - - [21/Apr/2024:06:25:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
www.earth.org.uk:443 17.58.X.X - - [21/Apr/2024:06:25:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
www.earth.org.uk:443 162.19.X.X - - [21/Apr/2024:06:25:41 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 124 "-" "Wget/1.21.3"
www.earth.org.uk:443 104.237.X.X - - [21/Apr/2024:06:29:35 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3421 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXXX; +http://overcast.fm/)"
www.earth.org.uk:80 54.200.X.X - - [21/Apr/2024:06:30:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11547 "-" "Amazon Music Podcast"
INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA...
9632 102051873 ALL
2318 26378397 "Amazon Music Podcast"
1696 25305017 "iTMS"
1202 5814066 "Spotify/1.0"
640 5829187 "Podbean/FeedUpdate 2.1"
INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)...
384 2484922 00
391 2633980 01
380 2889315 02
405 3060896 03
317 2103272 04
INFO: /tmp/stats.out/feedStatusByUA.log: feed hits and status by UA...
9632 102051873 200:304:406:429 6441 1207 1518 429 ALL
2318 26378397 200:304:406:429 2243 0 4 71 "Amazon Music Podcast"
1696 25305017 200:304:406:429 672 6 1014 0 "iTMS"
1202 5814066 200:304:406:429 576 588 0 38 "Spotify/1.0"
640 5829187 200:304:406:429 501 0 0 139 "Podbean/FeedUpdate 2.1"
INFO: /tmp/stats.out/feedStatusByHour.log: feed hits and status by hour (UTC)...
384 2484922 200:304:406:429 191 51 90 52 00
391 2633980 200:304:406:429 199 56 89 44 01
380 2889315 200:304:406:429 229 50 92 9 02
405 3060896 200:304:406:429 251 50 96 8 03
317 2103272 200:304:406:429 167 44 86 19 04

Spotify and the 'lite' feed were added during this stats interval.

The 406 and 429 defences are trimming some waste. Most of the iTunes (iTMS) requests are being rejected with 406, and without a detailed check I suspect that those lonely six 304s were actually a feed validator pretending to be a well-behaved version of it! Amazon does back off somewhat when fed 429s, which is good, though not enough.

The summary line from feedStatusByHour.log demonstrates the waste. In this 6-day interval no new podcast episodes were added, I fiddled with (eg re-arranged) metadata at most a handful of times, there are still at most a handful of listeners, and yet the feed file was polled 9632 times, a vast majority of those unconditionally (some of which have been rejected with 406/429).
9632 102051873 200:304:406:429 6441 1207 1518 429 ALL

A more sensible result, for 20 imperfect clients, and one feed change per day, rather than just for less than monthly appearances of new episodes, might have been (cue dreamy music):

240 2000000 200:304:406:429 120 120 0 0 ALL

The feed file is ~100kB uncompressed and ~11kB gzip compressed (~9kB br compressed), and there is per-request overhead.

Here is one client, Feeder, with at least a couple of separate users, showing near-optimal behaviour:

14 80589 200:304:406:429 7 7 0 0 "SpaceCowboys Android RSS Reader / 2.6.22(307)"

Feeder got an upgrade during the sampling interval, so there is also a fab all-304s:

3 315 200:304:406:429 0 3 0 0 "SpaceCowboys Android RSS Reader / 2.6.23(308)"

2024-04-28: Cacheing Tweak

I have modified the caching for RSS feeds to be 4h7 by default, but in the skipHours block to be 10h7 to jump out of the skipHours block in one go, and just before the skipHours block starts jump right over the block to reduce wasted polls.

A feed consumer correctly following cacheing should barely need to implement skipHours where in a single large block like this, and where the consumer is fairly continuously connected, though it remains useful declaratively.

# Set cache time, ie minimum poll interval.
# Give podcast RSS and similar feeds longer expiry out of work hours.
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
# This should be long enough to jump out of skipHours in one go.
ExpiresByType application/rss+xml "access plus 10 hours 7 minutes"
</If>
<ElseIf "%{TIME_HOUR} -gt 17">
# Jump expiry right over coming skipHours block.
ExpiresByType application/rss+xml "access plus 14 hours 7 minutes"
</ElseIf>
<Else>
# Give podcast RSS and similar feeds a default expiry time of ~4h.
ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
</Else>

Bing activity timing

In the Bing console I have tweaked most of Bing's spidering of my sites to be around noon UTC, when there is most likely to be available solar power (off-grid and grid-tied) to cover the network and CPU load directly rather than from battery.

2025-04-29: Site ExpiresDefault

Looking again at the default expiry time for the desktop site I think that it is too harsh at ~11 days, so I have modified it to be at that level in winter when conserving is key, but at ~1 day otherwise. This could be just Dec/Jan.

# Default ~1 day to optimise Cache-Control max-age to 5 digits and for HPACK.
ExpiresDefault "access plus 92222 seconds"
<If "%{TIME_MON} -lt 3 || %{TIME_MON} -gt 10">
# Winter ~11 days to reduce load a little.
ExpiresDefault "access plus 922222 seconds"
</If>

Data cacheing

Everything under /data/ has a cache life of 1 day. Much of the data is effectively immutable.

For winter I have made the default 2 days for /data/, using similar code to the above with TIME_MON.

As a next step I am making (usually yearly) solid .xz, and also (typically monthly) .gz, archives get a year's cache life:

<LocationMatch "/data/.*\.(xz|gz|zip)$">
    ExpiresDefault "access plus 1 year"
</LocationMatch>
% find data -name '*.xz' | wc -l
243
% find data -name '*.gz' | wc -l
1102
% find data -name '*.zip' | wc -l
9

I am considering changing the /data/ default to ~30 days, and then reducing back to a day items likely to update more often:

  • any directory, or at least any called live
  • any file whose name contains the current year (or near future years)
% find data | wc -l
15713
% find data -type d | wc -l
238
% find data -name live -type d | wc -l
6
% find data -name '*2024*' | wc -l
1122
% find data -name '*2025*' | wc -l
0
% find data -name '*2026*' | wc -l
0
...

2024-04-27: Big Bad Clients

There seem to be services for some of the biggest tech companies that cannot be bothered to allow even gzip compression (eg Meta's facebookexternalhit), lazily wasting oodles of bandwidth for everyone. For the Gallery I am going to disallow unconditional GETs with 406s where compression is not accepted.

Here is a slightly interesting case in which it would have been nice to have sent a 200...

gallery.hd.org:80 185.15.X.X - - [28/Apr/2024:12:19:21 +0000] "HEAD /_c/places-and-sights/_more2003/_more08/Turkey-Alaja-Huyuk-Hittite-temple-carving-of-two-headed-eagle-with-two-rabbits-in-its-claws-SEW.jpg.html HTTP/1.1" 406 129 "-" "IABot/2.0 (+https://meta.wikimedia.org/wiki/InternetArchiveBot/FAQ_for_sysadmins) (Checking if link from Wikipedia is broken and needs removal)"

2024-04-24: AMP Be Going Moar

While waiting for a train I am trimming a few bits of AMP crud, generating 410s ("gone") in their place:

  • the ancient experimental /ext/e/ img and out bridge in jworkers
  • the ancient /amp/XXX AMP pages view in EOU

Pages with m-dot counterparts are still redirected to them, but with the redirect strength upgraded from 302 (temporary) to 301 (permanent). About 12 such 301 redirects happened in the first half an hour or so after the change... I could instead generate a 410 (gone) if the Referer is absent, ie this appears to be a spider/bot rather than a human, but I already have robots.txt set to forbid all spidering...

2024-04-23: Podcast Lite

I have created a stripped-back item-count-limited 'lite' version of the podcast feed beside the primary. The lite version omits videos and metadata that most readers and aggregators do not use, though they should! The uncompressed 'lite' feed is not much bigger than the compressed full feed.

96985 rss/podcast.rss
11235 rss/podcast.rssgz
 9241 rss/podcast.rssbr
12937 rss/podcast-lite.rss
 2879 rss/podcast-lite.rssgz
 2402 rss/podcast-lite.rssbr

Slightly against my better judgement I have handed this feed to Spotify, only.

Spotify is polling about every 7 minutes, and does seem to support at least gzip compression, and is doing conditional GETs. The first is stupidly too fast like Amazon and Apple, the other points are good.

(Spotify previously rejected the full feed because it contained two videos. Spotify's automated systems have rejected a couple of episodes that it says are music tracks (they are indeed a couple of my short generative music clips) and thus are not in line with Spotify podcast policy. I have created a new 'music' tag to mark (and exclude) such items. This podcast may not last long on Spotify!)

I have extended the op3.dev enclosure URL prefixes to make it clearer where any download traffic is coming from. This extended URL now also contains the source feed GUID; no personal tracking information.

I may redirect to this feed bots that might otherwise get a 406 for not supporting even gzip compression for the primary feed.

I have adjusted the lite feed to use the lower-fi MP3 (.mp3L) audio where available, to to be a maximum of 10 items. Definitely 'lighter' all round.

96997 rss/podcast.rss
11244 rss/podcast.rssgz
 9251 rss/podcast.rssbr
 9167 rss/podcast-lite.rss
 2272 rss/podcast-lite.rssgz
 1898 rss/podcast-lite.rssbr

(I am using the same guid for the episode/item in the lite feed as in the main one, even though the former uses a different (lo-fi) enclosure. Maybe this is wrong...)

I have also adjusted lite feed page links to point to pages on the lite/m-dot site.

2024-04-22: lastBuildDate

I am switching the RSS feeds from using at channel (ie top) level pubDate to lastBuildDate.

I doubt that anything much cares, and it is still not logically entirely right, but it is probably better. (It is the more popular in extant feeds, in ~70% of RSS podcast feeds vs ~30% for pubDate, IIRC.)

2024-04-21: ETag Be Gone!

I have disabled ETags for the whole of the EOU site, plus the Gallery and ExNet. I also invoke FileETag none, which should save Apache from even calculating ETag values.

This should enable all on-the-fly compressed material to be cached better, eg including slow-changing directory listings, and data files and logs. It may result in some cache misses from clients that present next time an If-None-Match with no If-Modified-Since.

To enable Last-Modified for directories I need to add +TrackModified to IndexOptions. Note the caveat in the documentation that Changes to the size or date stamp of an existing file will not update the Last-Modified header on all Unix platforms. Thus this may only reflect changes (add/delete entries) to the directory itself, which I am happy with. Changes for items in a directory should be tracked on those items; the essence of the directory is the list of entries. As a fallback I could restrict this to just /img which should not have (many) changes to existing files.

(: I have disabled the If-None-Match tests in the 406 and 429 rules, since we cannot use those conditionals when the site is not generating ETags.)

AMP be going

While trimming the size of generated error pages (such as for 404s) I also stripped out a bit of AMP complexity and cruft.

Bad wasteful bots

There seem to be wasteful clients (often "Go") that do not implement even gzip encoding/compression. I now treat that as equivalent to Save-Data for main/top pages, ie they cannot have the the best stuff, wasting bandwidth! They get the smaller 'lite' pages.

All bona fide browsers and mainstream search engine bots that I know of, including lynx, support gzip encoding/compression.

I am taking this as a prompt to make some 'lite' page versions even smaller, eg:

428905 energy-series-dataset.html
120362 m/energy-series-dataset.html

I have also taken the opportunity to slightly rationalise the code for Save-Data, including not adding the header to Vary unless the outcome actually depended on the presence of Save-Data. That makes me a bit uncomfortable, ie not always adding to Vary for appropriate objects, but it saves some bytes and is probably correct.

2024-04-20: ClaudeBot Be Gone!

When I exclude a badly-behaved bot from a site with robots.txt, that is not an invitation for the bot to re-check every few seconds just in case I want to be friends now.

Given the log-filling and CPU-wasting nonsense below for gallery.hd.org, I have also excluded ClaudeBot from EOU:

[20/Apr/2024:08:07:46 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "http://mirror-us-ga1.gallery.hd.org/robots.txt" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
[20/Apr/2024:08:07:47 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
[20/Apr/2024:08:07:49 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
[20/Apr/2024:08:07:49 +0000] "GET /robots.txt HTTP/1.1" 301 509 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
[20/Apr/2024:08:07:49 +0000] "GET /robots.txt HTTP/1.1" 200 1566 "http://mirror-us-ga1.gallery.hd.org/robots.txt" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

That is a block of successive hits on all my static sites, though omitting a couple scraping EOU...

ClaudeBot has been greedy busy on my sites lately. As has ChatGPT plus the usual cohort of seeming scrapers and spiders.

I do not particularly object to the AI usage — it is part of the reason for having semantic markup everywhere. But when greedy enough to effectively perform denial of service (DoS), eg obstructing me in my own use of my sites and logs, that is a reason for a robots.txt ban. And those bans tend to be permanent, since it not obvious when to manually check and trim robots.txt.

(A few hours after sending a slightly angry email about the above to the only anthropic.com email address I could find (the press team) the ClaudeBot activity stopped.)

2024-04-19: Precompressed Podcast RSS

To save a little bandwidth, and CPU time on each fetch, the podcast RSS file now has precompressed Brotli and Gzip (zopfli) versions generated when the RSS feed file is updated.

89914 19 Apr 17:16 rss/podcast.rss
10993 19 Apr 17:16 rss/podcast.rssgz
 9001 19 Apr 17:16 rss/podcast.rssbr
% gzip -6 < rss/podcast.rss | wc -c
   11490

(When the precompressed versions are not available, normal on-the-fly gzip compression by mod_deflate remains available, equivalent to gzip -6.)

The Apache configuration has been updated to serve the precompressed versions to capable clients.

The first log entry is before the precompressed version were (manually, this time) put in place, and the others after:

[19/Apr/2024:16:40:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12064 "-" "Amazon Music Podcast"
[19/Apr/2024:16:45:20 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11550 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0"
[19/Apr/2024:16:46:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11551 "-" "Amazon Music Podcast"
[19/Apr/2024:17:01:50 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 9385 "-" "PodcastAddict/v5W (+https://podcastaddict.com/; Android podcast app)"

This represents an apparent ~5% saving for gzip-capable clients, and an apparent ~25% saving for br-capable clients. (~90% saving for the latter vs uncompressed...)

2024-04-17: CORS and ETag

I have slightly adjusted the Apache configuration to work with CORS, and drop ETag, for all .rss, .atom (and .xml and .vtt) files, rather than everything under /rss/. In particular this now includes /sitemap.atom and /sitemap.xml. I hope that this will improve cacheability (and other usability) of Atom feeds a little.

<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt|atom|xml)$">
    Header set access-control-allow-origin *
    # Avoid Apache ETag / mod_deflate bug.
    Header unset ETag
    # DHD20240413: DeflateAlterETag is unsupported for sencha.
    #DeflateAlterETag Remove
  </FilesMatch>
</IfModule>

... and the first few 304s for /sitemap.atom have come through:

[17/Apr/2024:09:48:23 +0000] "GET /sitemap.atom HTTP/1.1" 304 3571 "-" "Feedbin feed-id:XXXX - 1 subscribers"
[17/Apr/2024:09:54:54 +0000] "GET /sitemap.atom HTTP/1.1" 304 3374 "-" "Mozilla/5.0 (compatible; theoldreader.com; 1 subscribers; feed-id=XXXX)"
[17/Apr/2024:10:10:07 +0000] "GET /sitemap.atom HTTP/1.1" 304 185 "-" "NewsBlur Feed Fetcher - 1 subscriber - https://www.newsblur.com/site/XXXX/earth-notes-basic-feed (\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15\")"
[17/Apr/2024:10:11:50 +0000] "GET /sitemap.atom HTTP/1.1" 304 185 "-" "Feedbin feed-id:XXXX - 1 subscribers"
[17/Apr/2024:10:18:35 +0000] "GET /sitemap.atom HTTP/1.1" 304 3571 "-" "Feedbin feed-id:XXXX - 1 subscribers"

(I also zapped a defunct ~ /sitemap.xml for the m-dot site; everything is handled in the main site sitemap now.)

2024-04-16: RSS Stats

I built a script to gather a standard set of RSS-related stats from the last ~week of EOU logs, and that data has been captured for later, mwhahahah!

This is what a run of it looks like:

INFO: /tmp/stats.out/interval.txt: 2024-04-07T06:25:14 to 2024-04-15T06:25:10 inclusive log data
INFO: hits: all 235499, site 115578, feed 9643
INFO: /tmp/stats.out/allHitsByHour.log: all hits by hour (UTC)...
9643 175368483 ALL
460 9125257 17
459 7685543 12
446 7737445 18
434 7820128 15
INFO: /tmp/stats.out/feedHits.log: RSS feed hits...
www.earth.org.uk:80 34.220.118.X - - [07/Apr/2024:06:27:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11526 "-" "Amazon Music Podcast"
www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS"
www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS"
www.earth.org.uk:443 17.58.59.X - - [07/Apr/2024:06:27:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 85016 "-" "iTMS"
www.earth.org.uk:443 104.237.137.X - - [07/Apr/2024:06:29:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 14780 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXXXXXX; +http://overcast.fm/)"
INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA...
9643 175368483 ALL
2806 34128111 "Amazon Music Podcast"
2401 73456937 "iTMS"
542 6380012 "Podbean/FeedUpdate 2.1"
483 8501504 "-"
INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)...
9643 175368483 ALL
460 9125257 17
459 7685543 12
446 7737445 18
434 7820128 15

2024-04-15: Greenlink support

I added support for the coming INTGRNL 'Greenlink' Irish interconnector ready for go-live on 2024-08-01.

2024-04-14: Podcasting 2.0, TTL, 406, 429

I have added a little more Podcasting 2.0 metadata to my RSS feeds. The non-podcast feeds now include these channel tags:

<podcast:medium>blog</podcast:medium>
<podcast:location geo="geo:51.406696,-0.288789,16">16WW, Kingston-upon-Thames, UK</podcast:location>
<podcast:podroll><podcast:remoteItem feedGuid="02b2185f-3173-5e6f-bdda-cc60fb797f84"/></podcast:podroll>
<podcast:updateFrequency rrule="FREQ=MONTHLY">monthly</podcast:updateFrequency>

That is: medium, location, podroll, updateFrequency.

The main podcast RSS is not using medium or podroll, but is already using item tags transcript and alternateEnclosure.

TTL

I have pushed up all the RSS feed TTL values to a little over 3 days (4327 minutes).

406 Not Acceptable

In an attempt to push back on some of the more badly-behaved bots, I have added to the 'overnight' Apache configuration block covering skipHours:

# Reject (bot) attempts to unconditionally fetch without compression.
# 406 Unacceptable.
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
RewriteCond %{HTTP:If-None-Match} ^$ [NV]
RewriteCond %{HTTP:Accept-Encoding} ^$
RewriteRule "^/rss/.*\.rss$" - [L,R=406]

This is saying that an empty/missing Accept-Encoding, eg precluding ~7x bandwidth reduction though gzip compression, is not reasonable.

Trying it out in daylight yielded these early 406s (yes, the last two are from Apple...):

[14/Apr/2024:13:44:31 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 148 "-" "-"
[14/Apr/2024:13:44:31 +0000] "GET /rss/podcast.rss HTTP/1.0" 406 418 "-" "-"
[14/Apr/2024:13:53:13 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS"
[14/Apr/2024:13:53:13 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"

Everything is now in place for when the weekly logs roll tomorrow morning!

The defensive Apache config for RSS is now:

# Allow CORS to work for RSS feeds and transcripts.
# This allows browsers to access them from non-EOU pages.
<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt)$">
    Header set access-control-allow-origin *
  </FilesMatch>
</IfModule>
# Help conditional requests work by removing the unhelpful XXX-gzip ETag.
# https://httpd.apache.org/docs/current/mod/mod_deflate.html#deflatealteretag
<Location /rss>
    Header unset ETag
    # DHD20240413: DeflateAlterETag is unsupported for sencha.
    #DeflateAlterETag Remove
</Location>
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
    # Give podcast RSS and similar feeds longer expiry out of work hours.
    ExpiresByType application/rss+xml "access plus 7 hours 7 minutes"
    #
    # Reject (bot) attempts to unconditionally fetch without compression.
    # 406 Unacceptable.
    RewriteCond %{HTTP_REFERER} ^$
    RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
    RewriteCond %{HTTP:If-None-Match} ^$ [NV]
    RewriteCond %{HTTP:Accept-Encoding} ^$
    RewriteRule "^/rss/.*\.rss$" - [L,R=406]
    #
    # For RSS files (which will have skipHours matching the above),
    # if there is no Referer and no conditional fetching, back off
    # when battery is low.
    # 429 Too Many Requests
    RewriteCond %{HTTP_REFERER} ^$
    RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
    RewriteCond %{HTTP:If-None-Match} ^$ [NV]
    RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f
    RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
    Header always set Retry-After "25620" env=RSS_RATE_LIMIT
</If>
<Else>
    # Give podcast RSS and similar feeds an expiry time of ~4h.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
</Else>

428 Precondition Required is an alternative plausible status in place of 406 or 429, though any client has to be able to make the first fetch and, at least occasionally, a unconditional fetch.

2024-04-15: no go

Oh dear, that did not seem to be generating 406s at ~05:00Z. Reformulated:

# Allow CORS to work for RSS feeds and transcripts.
# This allows browsers to access them from non-EOU pages.
<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt)$">
    Header set access-control-allow-origin *
  </FilesMatch>
</IfModule>
# Help conditional requests work by removing the unhelpful XXX-gzip ETag.
# https://httpd.apache.org/docs/current/mod/mod_deflate.html#deflatealteretag
<Location /rss>
    Header unset ETag
    # DHD20240413: DeflateAlterETag is unsupported for sencha.
    #DeflateAlterETag Remove
</Location>
# Reject (bot) attempts to unconditionally fetch without compression.
# 406 Unacceptable.
RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
RewriteCond %{HTTP:If-None-Match} ^$ [NV]
#RewriteCond %{HTTP:Accept-Encoding} ^$ [OR]
RewriteCond %{HTTP:Accept-Encoding} !gzip
RewriteRule "^/rss/.*\.rss$" - [L,R=406]
#
# For RSS files (which will have skipHours matching the above),
# if there is no Referer and no conditional fetching, back off
# when battery is low.
# 429 Too Many Requests
RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
RewriteCond %{HTTP:If-None-Match} ^$ [NV]
RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f
RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
Header always set Retry-After "25620" env=RSS_RATE_LIMIT
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
# Give podcast RSS and similar feeds longer expiry out of work hours.
ExpiresByType application/rss+xml "access plus 7 hours 7 minutes"
</If>
<Else>
# Give podcast RSS and similar feeds an expiry time of ~4h.
ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
</Else>

For the 406 case I now reject a lack of gzip support, not just an empty/missing Accept-Encoding header.

Sample rejections (which stopped by 08:00Z as intended):

[15/Apr/2024:06:56:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 428 "-" "Go-http-client/1.1"
[15/Apr/2024:06:56:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 428 "-" "Go-http-client/1.1"
[15/Apr/2024:07:02:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS"
[15/Apr/2024:07:02:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"
[15/Apr/2024:07:10:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 584 "-" "taddy.org/developers 1.0"
[15/Apr/2024:07:11:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3311 "-" "iTMS"
[15/Apr/2024:07:11:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 158 "-" "iTMS"

A small tweak to the 406 part will reject non-compressed fetches when the GB grid has high intensity compared to the last week, since the Internet upstream of me is at least in part GB-grid powered.

-RewriteCond "%{TIME_HOUR}" ">21"
+RewriteCond "%{TIME_HOUR}" ">21" [OR]
+RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f

Another line could reject non-compressed fetches if local battery was low, though doing compression may cost more CPU and battery than encrypting the longer non-compressed response, if I do not pre-compress them.

Providing pre-compressed Brotli RSS feed versions might (from a quick test) save ~20% bandwidth for unconditional transfers, and for when there is a feed change. But cutting the number of unconditional polls would save much more bandwidth. (Note that any byte saving is diminished by https overheads.)

I estimate that ~50% of 'bad' unconditional requests without compression support will be rejected with 406s.

More 429

For the 429 case I have added an "if GB grid intensity is high" ORed with the existing "if battery is low" clause.

 RewriteCond %{HTTP_REFERER} ^$
 RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
 RewriteCond %{HTTP:If-None-Match} ^$ [NV]
+# Have any interaction with the filesystem as late as possible.
+RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f [OR]
 RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f
 RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]
 Header always set Retry-After "25620" env=RSS_RATE_LIMIT

So if during skipHours an unconditional feed request is made and either of those is the case, the client will now get a 429. So Amazon, Apple, PodBean, and Deezer will be getting more 429s in their futures. Let us see if my feed is dropped, I receive a complaint, or an intrigued engineer works out what is going on and improves things for all parties. I would like the last, but do not hold out too much hope!

2024-04-19: since high up in a feed-puller bad-boys list comes some anonymous thing(s) (with no User-Agent), I have added a clause to treat that as a sin on a par with low battery and high grid carbon intensity during skipHours:

 RewriteCond %{HTTP:If-None-Match} ^$ [NV]
+# Not saying who you are (no User-Agent) and ignoring skipHours is rude.
+RewriteCond %{HTTP:User-Agent} ^$ [NV,OR]

2024-04-29: also now for 406 no User-Agent) no User-Agent is on a par high grid carbon intensity, for no-Referer unconditional requests not allowing compression:

 RewriteCond "%{TIME_HOUR}" "<08" [OR]
 RewriteCond "%{TIME_HOUR}" ">21" [OR]
+# Not saying who you are (no User-Agent) and not allowing compression is rude.
+RewriteCond %{HTTP:User-Agent} ^$ [NV,OR]

2024-04-16: 406 and 429 custom error pages

(This evening, now that GB grid intensity is relatively high vs the last 7 days, my server is starting to reject some of the clownishly-bad RSS feed polling, eg by iTunes: ~1000x too often, ignoring Cache-Control, with no If-None-Match, no If-Modified-Since, and no Accept-Encoding to allow a gzip ~7x bytes saving. Come on Apple, you can engineer better than this!)

To try and give that intrigued engineer a clue, I have added custom error pages for 406 and 429, with helpful pointers. I may have to update these as and when I update my defences...

Here is the current 406 text:

406: Not Acceptable

Bad request Accept headers

Please:

  • allow at least gzip compression in Accept-Encoding
  • where possible use conditional requests with If-None-Match or If-Modified-Since
  • where possible honour Cache-Control or Expires and similar refresh hints such as RSS skipHours; help save bandwidth, CPU and climate

Small irony: the new messages are a couple of hundred bytes longer on the wire each (less than 10%, given https overheads), especially given that compression is often not being supported! I am trimming them (and all noindex pages) a little. Almost none will be read by humans, so elegant prose is largely wasted!

Log-of-shame sample:

[16/Apr/2024:18:22:06 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:22:06 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:18:30:28 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:30:29 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:18:34:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 5101 "-" "Podchaser (https://www.podchaser.com)"
[16/Apr/2024:18:42:10 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:42:10 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:18:45:52 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 309 "-" "-"
[16/Apr/2024:18:45:52 +0000] "GET /rss/podcast.rss HTTP/1.0" 406 1875 "-" "-"
[16/Apr/2024:18:54:32 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:18:54:32 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"
[16/Apr/2024:19:08:05 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3506 "-" "iTMS"
[16/Apr/2024:19:08:05 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 319 "-" "iTMS"

2024-04-13: If-None-Match

The Feeder podcast reader is paying attention to HTTP cache control, but although it is apparently using If-None-Match it is not seeing 304 results.

The Apache 2.4 mod_deflate DeflateAlterETag documentation points out that the new AddSuffix default prevents serving "HTTP Not Modified" (304) responses to conditional requests for compressed content.

This does not affect my pre-compressed Gzip and Brotli page responses which correctly serve an ETag based on the actual file served, ie different for the uncompressed, Gzip and Brotli response variants.

I am trying to fix this by removing the unhelpful XXX-gzip ETag for these feed files. Header unset ETag is used because DeflateAlterETag Remove is unsupported in my server.

<Location /rss>
    Header unset ETag
</Location>

I have added the same Header unset ETag for stuff under /img since If-Modified-Since should be enough (no races possible) for immutable content. A slightly better workaround might be RequestHeader edit "If-None-Match" '^"((.*)-gzip)"$' '"$1", "$2"' to allow ETags to work again as intended.

This is effectively an Apache 2.4 mod_deflate ETag bug I think; the ETag should be modified for the compressed variant, but that modified tag should be correctly matched for a subsequent conditional request.

(The DeflateAlterETag Remove should be used rather than Header unset ETag, to avoid losing ETag where they may still be helpful such as on audio and image files.)

This seems to have increased the number of 304s, and the variety of clients getting them, from a trailing sample:

[14/Apr/2024:05:16:40 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 93 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
[14/Apr/2024:05:19:59 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 222 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0"
[14/Apr/2024:05:20:54 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3565 "-" "NRCAudioIndexer/1.1"
[14/Apr/2024:05:46:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"
[14/Apr/2024:05:54:31 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3377 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=2522513; +http://overcast.fm/)"
[14/Apr/2024:06:19:11 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 93 "-" "Wget/1.21.3"
[14/Apr/2024:07:01:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"
[14/Apr/2024:07:04:05 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3377 "-" "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=2522513; +http://overcast.fm/)"
[14/Apr/2024:07:04:58 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 167 "-" "Aggrivator (PodcastIndex.org)/v0.1.7"
[14/Apr/2024:07:07:49 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3565 "-" "NRCAudioIndexer/1.1"
...
[14/Apr/2024:08:42:11 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"

That last is possibly the first-ever 304 for SpaceCowboys / Feeder, which uses OkHttp.

2024-04-11: Moar Transcripts

I am making my way through the remaining missing WebVTT transcripts!

(The last three were hammered out the following morning, first thing...)

2024-04-09: Like and Subscribe Boilerplate

I have added standard like-and-subscribe (and "here are some podcast players") links to each normal desktop podcast page (as an aside). The same information is added to the main podcast section page also.

2024-04-04: Podcast Episode SQTNs

Since the Feeder podcast app seems as if it will show them, I have begun adding some square 'thumbnail' images to selected podcast episodes. They will be added to the RSS podcast feed as item (ie episode) itunes:images. Probably not big enough to technically meet Apple's spec. I have made sure that there is at least a lo-fi .jpgL / .pngL version of each such image so that non-smart readers presenting no Referer will eat less bandwidth.

These will not be visible on the podcast pages.

Podcast episode text icons

I am creating a set of standard cover 'art' icons with text to png converter 400x400, horizontally and vertically centred, Helvetica 96px, black on white.

Transcripts on Apple Podcasts

The WebVTT transcripts that I have provided are visible in the macOS Podcasts application on my MacBook Air now.

They do not seem to do anything very useful, eg highlight the current text, but they are there.

"Automatically generated" transcripts seem to work too, though are completely blank for pure music, eg not even a [MUSIC]!

I see that in one case the automated transcription cleverly linked up a spoken domain name, EOU in this case.

: finished the last of the 60 podcast transcripts.

2024-04-02: ORCID Byline

For those articles that I have flagged as 'research' an ORCID logo linked to my record ORCID logo. is now being added to the by-line.

I have copied the appropriate small logo to the EOU site so as not to add load (or inadvertent tracking) to the main ORCID site.

The original does not seem to be efficiently compressed, though my copy now is, so there is a bunch more wasted bandwidth...

% zopflipng -m -m ~/Downloads/5008697/ORCID-iD_icon-16x16.png img/3rdParty/ORCID-iD_icon-16x16.png
Optimizing /Users/dhd/Downloads/5008697/ORCID-iD_icon-16x16.png
Input size: 1261 (1K)
Result size: 218 (0K). Percentage of original: 17.288%
Result is smaller

RSS work storage

I have adjusted the makefile to avoid rebuilding the RSS feed files if the 24h GB grid intensity is high/red because updated files may result in more Internet traffic (200s, not 304s). Parts of the Internet traffic near me use that GB grid power.

Also the local power status has to be HIGH for most RSS feeds to be rebuilt, and not LOW for the podcast RSS feed file to be.

% ls -al _gridCarbonIntensityGB.red.flag
0 Apr  2 05:31 _gridCarbonIntensityGB.red.flag
% make rss/*.built
make: Nothing to be done for 'rss/note-on-site-technicals.rss.built'.
make: Nothing to be done for 'rss/podcast.rss.built'.
make: Nothing to be done for 'rss/saving-electricity.rss.built'.

Today it has been red since 05:31Z (~6:30am), up until ~9pm so far. So this may need to be relaxed a little. The feed can easily be manually built with the script if need be.

I have applied similar build restrictions to other 'feed' files.

This is a form of work storage or deferral until better times.

(See previous work storage note.)

~3589 words.