Earth Notes: On Website Technicals (2024-05)

Updated 2024-05-28 07:58 GMT.
By Damon Hart-Davis.
Tech updates: RED 406 and 429, mod_dumpio, time hash, mod_log_forensic, Sec-CH-UA-Mobile, rate limit, NTP, feed diet, time series, rogue Googlebot...
tools
The RSS efficiency side-quest continues...

2024-05-26: CORS Redux

I was not getting the right CORS-friendly header set for RSS feed files. I guessed that was because the Apache FilesMatch is based on the literal name of the file served, not the request URL, so if we serve precompressed file its extension would not match. Fixed, maybe:

# Allow CORS to work for RSS and Atom feeds and transcripts.
# This allows for precompressed versions also.
# This allows browsers to access them from non-EOU pages.
<IfModule mod_headers.c>
  <FilesMatch "\.(rss|vtt|atom|xml)(gz|br)?$">
    Header set access-control-allow-origin *
    # Remove unhelpful header to save bandwidth.
    Header unset Accept-Ranges
  </FilesMatch>
</IfModule>

2024-05-24: Googlebot Behaving Badly

...
www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:49:40 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:51:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:52:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:53:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:54:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:56:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:57:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.164 - - [24/May/2024:06:58:28 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.65.166 - - [24/May/2024:06:59:29 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...
www.earth.org.uk:443 66.249.70.1 - - [24/May/2024:10:12:48 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.70.8 - - [24/May/2024:10:13:49 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13322 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.70.8 - - [24/May/2024:10:17:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.70.2 - - [24/May/2024:10:18:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
www.earth.org.uk:443 66.249.70.2 - - [24/May/2024:10:19:23 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13322 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...

More than 200 such requests before 08:00Z today! (~270 by 10:20Z, ~400 by 15:20Z, 814 by end of day, and only 439 other fetches (194 from EOU) in the day (bye-bye crawl budget!); still grinding away and over 300 by 06:30Z the next morning...) That will get it into the greedy bots list for maybe a week or two!

Yep, there it is at number 7:

# Built: 2024-05-24T09:40+00:00
# MAXHITSPERUAPERDAY: 25
# MAXUAS: 10
#----------------
# request-count User-Agent
# MD5hash approx-hits-per-day
# 201 iTMS
97f76eb7e02c5ff923e1198ff1c288cd 201
# 190 Spotify/1.0
4582d9bdbcef42af27d89da91c6eb804 190
...
# 34 Mozilla/5.0 (comp
af26a291b02611b619f29585bbb19ac9 34
...

I have filed feedback via the GSC. I have also DMed a couple of Google people via Mastodon, and had a long chat in DMs!

By the , Googlebot was in 4th place, above Amazon!

# Greedy podcast feed pullers: keys are MD5 hashed User-Agent.
# A non-empty txt map lookup of the %{md5:%{HTTP:User-Agent}} means bad!
# Built: 2024-05-25T09:49+00:00
# MAXHITSPERUAPERDAY: 25
# MAXUAS: 10
#----------------
# request-count User-Agent
# MD5hash approx-hits-per-day
# 201 iTMS
97f76eb7e02c5ff923e1198ff1c288cd 201
# 191 Spotify/1.0
4582d9bdbcef42af27d89da91c6eb804 191
# 180 Podbean/FeedUpda
54e0e9df937b06cc83fab29f44c02b7f 180
# 111 Mozilla/5.0 (com
af26a291b02611b619f29585bbb19ac9 111
# 93 Amazon Music Podc
d69be2563c9f1929edf2906d41809aea 93
# 51 -
d41d8cd98f00b204e9800998ecf8427e 51
# 47 Mozilla/5.0 (Linu
e30b15ba56c2a3b272646f0a386feb95 47
# 29 Mozilla/5.0 (Wind
6b9a00393fb1607b0ada13520f814ab5 29

2024-05-26: still flailing!

Podcast RSS feed poll hits today/recent (26/May/2024) at 2024-05-26T08:42Z.

CountPartial User-Agent
422Mozilla/5.0 (comp
88iTMS
75Spotify/1.0
37Gofeed/1.0
34-
28Podbean/FeedUpdat
19Amazon Music Podc
14Mozilla/5.0 (Wind
9Overcast/1.0 Podc
9fyyd-poll-1/0.5

Googlebot is now in top spot and continuing to retry roughly every minute...

2024-05-23: RSS Time Series HTML

I have started writing scripts to generate HTML time-series analysis of the RSS data.

I added a new simple measure of the interval in days covered by each block of data, which tends to flip between 6 and 8 days, even though nominally 7. This makes normalisation easier. I also hand-crafted some of the subsidiary data extract files for older data blocks that are generated by newer runs of the extraction code, to keep the time-series analysis scripts simple.

Here is a sample table:

Summary of RSS feed fetch stats over time.
IntervalFeed/day% of EOU site trafficNote
EndingDaysHitsMBytesHitsBytes
2024-04-01T06:25Z 8 1077 16.8 7.5 1.5
2024-04-15T06:25Z 8 1205 21.9 8.3 2.3
2024-04-21T06:25Z 6 1027 11.6 4.2 0.9
2024-04-29T06:25Z 8 1204 12.8 7.6 0.7 2024-04-23: added Spotify with lite feed
2024-05-05T06:25Z 6 1285 12.6 8.9 1.4
2024-05-13T06:24Z 8 1070 7.1 6.4 0.8
2024-05-19T06:25Z 6 1144 6.1 7.5 0.6

About 1 hit per day per distinct client, thus ~20hit/d, with ~10kB/hit on a day when the feed changes (else much less), so ~20 hits and 0.2MB per day is what should be here with efficient clients.

2024-05-20: Faster Rate Limiting

I have sped up the 'slow' rate limiting to potentially update each day (when the sun is out), with a nominal limit for each User-Agent of one hit per hour. A bit tighter than before, thus catching a couple of extra User-Agents already. Still loose enough to allow the conventional 1-poll-per-hour mob through. Or what should correspond to several humans behind any one client User-Agent that respects Cache-Control max-age and/or skipHours, etc...

2024-05-19: Feed Diet

The logs rolled today, so a from this 6-day block:

% sh ./prepareStats.sh
INFO: /tmp/stats.out/interval.txt: 2024-05-13T06:25:12 to 2024-05-19T06:25:10 inclusive log data
INFO: hits: all 222476, site 92016, feed 6864
INFO: bytes: all 17563828813, site 6051922187, feed 36625901
INFO: /tmp/stats.out/allHitsByHour.log: all hits by hour (UTC)...
7541 705596235 00
7797 601362129 01
9212 848530101 02
8891 693462632 03
7931 1161997126 04
INFO: /tmp/stats.out/siteHitsByHour.log: site hits by hour (UTC)...
3019 247595118 00
2730 97503469 01
3802 213266786 02
4797 187066576 03
3305 586177279 04
INFO: /tmp/stats.out/feedHits.log: RSS feed hits...
www.earth.org.uk:80 52.39.X.X - - [13/May/2024:06:27:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11588 "-" "Amazon Music Podcast"
www.earth.org.uk:80 139.177.X.X - - [13/May/2024:06:27:42 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 183 "-" "Aggrivator (PodcastIndex.org)/v0.1.7"
www.earth.org.uk:80 107.23.X.X - - [13/May/2024:06:29:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11588 "-" "Amazon Music Podcast"
www.earth.org.uk:443 34.78.X.X - - [13/May/2024:06:31:30 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 200 5573 "-" "Spotify/1.0"
www.earth.org.uk:80 18.185.X.X - - [13/May/2024:06:32:00 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 9561 "-" "axios/1.5.1"
INFO: /tmp/stats.out/feedHitsByUA.log: feed hits by UA...
6864 36625901 ALL
1351 7312529 "iTMS"
1234 5467259 "Spotify/1.0"
1086 2844616 "Podbean/FeedUpdate 2.1"
722 6124269 "Amazon Music Podcast"
INFO: /tmp/stats.out/feedHitsByHour.log: feed hits by hour (UTC)...
270 1145460 00
276 1089997 01
235 1024117 02
239 1199946 03
224 1067031 04
INFO: /tmp/stats.out/feedStatusByUA.log: feed hits and status by UA...
6864 36625901 200:304:406:429:SH 2078 1028 328 3391 2538 ALL
1351 7312529 200:304:406:429:SH 144 0 216 991 557 "iTMS"
1234 5467259 200:304:406:429:SH 206 202 0 826 513 "Spotify/1.0"
1086 2844616 200:304:406:429:SH 191 0 0 895 231 "Podbean/FeedUpdate 2.1"
722 6124269 200:304:406:429:SH 511 0 2 209 245 "Amazon Music Podcast"
INFO: /tmp/stats.out/feedStatusByHour.log: feed hits and status by hour (UTC)...
270 1145460 200:304:406:429:SH 67 41 0 162 270 00
276 1089997 200:304:406:429:SH 56 45 0 175 276 01
235 1024117 200:304:406:429:SH 54 33 0 146 235 02
239 1199946 200:304:406:429:SH 62 37 0 132 239 03
224 1067031 200:304:406:429:SH 62 32 0 128 224 04

A fragment of live site stats (including RSS) from yesterday morning:

Stats updated: 2024-05-18T11:32Z

StatValue
Fraction bot hits (target <0.5)0.575
Fraction GET 200s0.715
Fraction GET 206s0.001
Fraction GET 301s0.026
Fraction GET 302s0.061
Fraction GET 304s0.067
Fraction GET 400s0.000
Fraction GET 403s0.000
Fraction GET 404s0.021
Fraction GET 406s0.001
Fraction GET 410s0.000
Fraction GET 416s0.000
Fraction GET 429s0.017
Fraction GETs0.909
Fraction HEADs0.073
Fraction HTTP hits0.205
Fraction HTTPS hits0.795
Fraction human main-page GET 200s0.043
Fraction human m/(m+www) main-page GET 200s0.147
Fraction RSS podcast feed bytes RSS/www (target <0.01)0.007
Fraction RSS podcast feed hits RSS/www (target <0.045)0.068
Fraction site hits amp/(amp+www)0.019
Fraction site hits m/(m+www)0.068
Mean human main-page transfer bytes m21436
Mean human main-page transfer bytes www27088
Mean transfer bytes amp474
Mean transfer bytes m31961
Mean transfer bytes www59668
Unique days seen14
Unique hours seen318
Unique human IPs/day equivalent152

Podcast RSS feed poll hits today/recent (18/May/2024) at 2024-05-18T11:32Z.

CountPartial User-Agent
112iTMS
99Spotify/1.0
51Podbean/FeedUpdat
23Amazon Music Podc
22-
20Mozilla/5.0 (Linu
16Mozilla/5.0 (Wind
16itms
12fyyd-poll-1/0.5
11iVoox Global Podc

Amazon pays some attention to 429 rejections and so has moved down the list. The rejection of almost all iTunes (iTMS) requests to compensate for the lack of compression trims bandwidth as planned. But at first blush, overall hit and bytes totals are not as low as hoped. Possibly the addition of Spotify in particular has made feed dieting harder.

20240417/siteHitsByHour.log:115578 7636896047 ALL
20240421/siteHitsByHour.log:145238 7919447459 ALL
20240429/siteHitsByHour.log:125957 14722083002 ALL
20240505/siteHitsByHour.log:86713 5429515419 ALL
20240513/siteHitsByHour.log:134388 7239370999 ALL
20240519/siteHitsByHour.log:92016 6051922187 ALL

From (8d):

% cat 20240417/interval.txt
2024-04-07T06:25:14 to 2024-04-15T06:25:10 inclusive log data
% head 20240417/feedHitsByUA.log
9643 175368483 ALL
2806 34128111 "Amazon Music Podcast"
2401 73456937 "iTMS"
542 6380012 "Podbean/FeedUpdate 2.1"
483 8501504 "-"
360 4300947 "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio"
250 3799126 "itms"
242 2815836 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
196 192996 "FeedBurner/1.0 (http://www.FeedBurner.com)"
192 2258465 "fyyd-poll-1/0.5"

To (6d):

% cat 20240519/interval.txt
2024-05-13T06:25:12 to 2024-05-19T06:25:10 inclusive log data
% head 20240519/feedHitsByUA.log
6864 36625901 ALL
1351 7312529 "iTMS"
1234 5467259 "Spotify/1.0"
1086 2844616 "Podbean/FeedUpdate 2.1"
722 6124269 "Amazon Music Podcast"
317 444963 "-"
243 2528919 "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio"
188 359888 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
144 1401495 "fyyd-poll-1/0.5"
128 34775 "Aggrivator (PodcastIndex.org)/v0.1.7"

Updated stats for the first hour or so since the log roll this morning:

Stats updated: 2024-05-19T07:52Z

StatValue
Fraction bot hits (target <0.5)0.651
Fraction GET 200s0.652
Fraction GET 206s0.001
Fraction GET 301s0.051
Fraction GET 302s0.080
Fraction GET 304s0.091
Fraction GET 400s0.000
Fraction GET 403s0.000
Fraction GET 404s0.044
Fraction GET 406s0.001
Fraction GET 410s0.000
Fraction GET 429s0.022
Fraction GETs0.943
Fraction HEADs0.030
Fraction HTTP hits0.238
Fraction HTTPS hits0.762
Fraction human main-page GET 200s0.044
Fraction human m/(m+www) main-page GET 200s0.087
Fraction RSS podcast feed bytes RSS/www (target <0.01)0.006
Fraction RSS podcast feed hits RSS/www (target <0.045)0.074
Fraction site hits amp/(amp+www)0.029
Fraction site hits m/(m+www)0.073
Mean human main-page transfer bytes m15754
Mean human main-page transfer bytes www32294
Mean transfer bytes amp776
Mean transfer bytes m29038
Mean transfer bytes www64480
Unique days seen7
Unique hours seen146
Unique human IPs/day equivalent131

Podcast RSS feed poll hits today/recent (19/May/2024) at 2024-05-19T07:52Z.

CountPartial User-Agent
13Spotify/1.0
12iTMS
6Mozilla/5.0 (Linux
6-
4Podbean/FeedUpdate
3Amazon Music Podca
2Mozilla/5.0 (Macin
2itms
2gPodder/3.11.1 (+h
2Aggrivator (Podcas

Note that there have been no new podcast episodes since , and no substantive metadata updates recently either.

These heavy defences, erected as some feed users will not do the right thing, result in bystander casualties. For example, about 10% of distinct User-Agents failed to make any successful (200 or 304) fetches at all, such as these (top 3):

29 47125 200:304:406:429:SH 0 0 0 0 16 "Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; bot@linkfluence.com)"
8 9886 200:304:406:429:SH 0 0 3 5 4 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0"
8 18732 200:304:406:429:SH 0 0 1 7 6 "Akkoma 3.11.0-0-ga03f3a9; https://social.wedistribute.org <hello@wedistribute.org>, Akkoma 3.11.0-0-ga03f3a9; https://social.wedistribute.org <hello@wedistribute.org>; Bot"

Most of these rejections are accesses in skipHours.

A new episode was published around 4pm.

2024-05-17: Nice I/O

The main server (sencha) is somewhat I/O constrained for its filesystem on microSD card. Some actions, such as a sweeping check/fix of permissions in the Web tree are a little intense and slow other activity down. Until now! The new command as run from cron is in part:

nice ionice -c idle make perms

which in theory gets out of the way of most other CPU and I/O demand! It at least does not seem to break anything...

Given that I am using the mq-deadline scheduler then ionice probably is not doing anything:

% mount
/dev/mmcblk0p2 on / type ext4 (rw,noatime,commit=300)
...
% grep "" /sys/block/*/queue/scheduler
...
/sys/block/loop7/queue/scheduler:[mq-deadline] kyber none
/sys/block/mmcblk0/queue/scheduler:[mq-deadline] kyber none
/sys/block/ram0/queue/scheduler:none
...

2024-05-16: ntp1 is Dead, Long Live ntp2!

I continue to get a huge amount of NTP traffic, ie many hits per second, though my server has not been stratum-1 for years. I suspect that my server name has been baked into software in routers, etc. So I have removed ntp1.exnet.com from DNS to see if that reduces the flow a little. I may rename ntp.exnet.com too, if the deluge continues!

A tiny snippet from tcpdump:

09:25:06.712266 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48
09:25:06.727073 IP XXXX > 79.135.97.78.123: NTPv4, Client, length 48
09:25:06.729648 IP XXXX > 79.135.97.78.123: NTPv3, Client, length 48
09:25:06.730250 IP 79.135.97.78.123 > XXXX NTPv3, Server, length 48
09:25:06.734552 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48
09:25:06.750494 IP XXXX > 79.135.97.78.123: NTPv4, Client, length 48
09:25:06.750879 IP 79.135.97.78.123 > XXXX NTPv4, Server, length 48
09:25:06.797435 IP XXXX > 79.135.97.79.123: NTPv4, Client, length 48

2024-05-13: Slow Rate Limiting

For a slow-update podcast feed such as mine, polling maybe once or twice per day is reasonable. Polling once per hour is customary, though still far too much. Polling much more than once per hour is foolish. One poll per hour for a week is 168 polls (8 days' worth is 192). It would be relatively easy to automate an analysis of the previous week's feed polls, and block with 429s outside the magic noon free-for-all window, regardless of (say) battery level, all or nearly all requests from bots that made more than ~200 requests and/or the top handful by request count or bytes. (Rejections could still be restricted to unconditional requests...)

For this last week that would have included:

1711 11259905 200:304:406:429:SH 242 0 567 902 696 "iTMS"
1647 8372482 200:304:406:429:SH 835 271 0 541 686 "Spotify/1.0"
1232 11778050 200:304:406:429:SH 997 0 8 227 393 "Amazon Music Podcast"
752 6162639 200:304:406:429:SH 524 0 0 228 165 "Podbean/FeedUpdate 2.1"
474 612922 200:304:406:429:SH 20 0 225 229 176 "-"
237 369734 200:304:406:429:SH 27 149 0 61 100 "Mozilla/5.0 (Windows NT 10.0; Win
64; x64) ..."
6"
219 1846087 200:304:406:429:SH 154 0 0 65 88 "Mozilla/5.0 (Linux;) AppleWebKit/
Chrome/ Safari - iHeartRadio"

A txt RewriteMap keyed on an escaped or hashed User-Agent might be enough, with only the very greedy bots being listed in the map at all.

I suspect that iTunes for one might throw a fit if reined in that hard.

There are complications around multiple independent users of one feed reader, polling of more than one feed (eg the podcast feed has two variants), polling of a feed in more than one step (eg HEAD then GET), spoofing (I do not believe that the first Mozilla User-Agent I have truncated above is an authentic browser), and single User-Agents with many random-ish cloud IP addresses. So maybe I should manage the map manually initially. Or set the threshold significantly higher and keep an eye on behaviour.

(My default cache expiry is a little over 4h, so any one client simply obeying that would only be making ~42 polls per week.)

When rejecting one of these with 429 it may be better to have any random rejection element varying slowly, eg by hour, so that naughty quick retries will not work. Then no goes on meaning no.

2024-05-14: First pass

Here is a first pass map file, with UAs shown truncated to reduce the chance of disclosing any PII:

# Greedy podcast feed pullers: keys are MD5 hashed User-Agent.
# A non-empty txt map lookup of the %{md5:%{HTTP:User-Agent}} means bad!
# Built: 2024-05-14T08:53+00:00
# MAXHITS: 400
# MAXUAS: 10
#----------------
# 1711 iTMS
97f76eb7e02c5ff923e1198ff1c288cd 4
# 1647 Spotify/1.0
4582d9bdbcef42af27d89da91c6eb804 4
# 1232 Amazon Music Po
d69be2563c9f1929edf2906d41809aea 3
# 752 Podbean/FeedUpda
54e0e9df937b06cc83fab29f44c02b7f 1
# 474 -
d41d8cd98f00b204e9800998ecf8427e 1

And here is sample Apache configuration to enforce it:

# Aggressively reject unconditional greedy podcast bots, other than noon.
RewriteMap greedyRSSBotMap "txt:/path/to/rss/greedy-podcast-bot-map.txt"
RewriteCond "%{TIME_HOUR}" "!=12"
RewriteCond %{HTTP_REFERER} ^$
# DHD20240512: using bogus very old IMS date is rude!
RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR]
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
# Reject almost all for random entire hours, to defeat quick retries.
RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/"
# Have any interaction with the filesystem as late as possible.
# Bot's UA hash is in the greedyRSSBotMap.
RewriteCond "${greedyRSSBotMap:%{md5:%{HTTP:User-Agent}}}" !^$ [NV]
RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]

This does not work because it seems that only constants and some variables (eg %{HTTP:User-Agent}) but not environment variables can appear as the key for the map. So it is all commented out again for now. I could possibly do something based on masked IP, which should be less easily spoofable, but would not catch cloud bots.

Here is a rather more brutal manual version:

# Aggressively reject unconditional greedy podcast bots, other than noon.
RewriteCond "%{TIME_HOUR}" "!=12"
RewriteCond %{HTTP_REFERER} ^$
# DHD20240512: using bogus very old IMS date is rude!
RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR]
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
## Reject almost all for random entire hours, to defeat quick retries.
#RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/"
# Have any interaction with the filesystem as late as possible.
# DHD20240514: Bad bot manually selected by UA.
RewriteCond "%{HTTP:User-Agent}" "^iTMS$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" "^Spotify/1.0$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" "^Amazon Music Podcast$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" "^Podbean/FeedUpdate 2.1$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" ^$ [NV]
RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:2]

This could be adjusted to allow all requests when battery is VHIGH, and/or a small fraction of requests as commented out in the above.

Here is a better dynamic version that only needs hash.flag files set to reject access, and is nice when VHIGH:

# Aggressively reject unconditional greedy podcast bots, other than noon.
RewriteCond "%{TIME_HOUR}" "!=12"
RewriteCond %{HTTP_REFERER} ^$
# DHD20240512: using bogus very old IMS date is rude!
RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR]
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
## Reject almost all random entire hours, to defeat naughty quick retries.
#RewriteCond expr "'%{md5:%{TIME_DAY}%{TIME_HOUR}}' =~ /^[^01]/"
# DHD20240514: greedy/bad bots manually selected by UA, or no UA.
#RewriteCond "%{HTTP:User-Agent}" "^iTMS$" [NV,OR]
#RewriteCond "%{HTTP:User-Agent}" "^Spotify/1.0$" [NV,OR]
#RewriteCond "%{HTTP:User-Agent}" "^Amazon Music Podcast$" [NV,OR]
#RewriteCond "%{HTTP:User-Agent}" "^Podbean/FeedUpdate 2.1$" [NV,OR]
RewriteCond "%{HTTP:User-Agent}" ^$ [NV,OR]
# Have any interaction with the filesystem as late as possible.
# Bot's hashed UA appears in flags dir?
RewriteCond expr "-f '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" [NV]
# Relent and allow request if battery SoC is (very) high.
RewriteCond /run/EXTERNAL_BATTERY_VHIGH.flag !-f
RewriteRule "^/rss/podcast.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:2]

Note that this only blocks requests for the podcast RSS feeds, and the request count used to drive this only looks at requests for them.

2024-05-12: If-Modified-Since Regex Error

I think that my regex to try to validate If-Modified-Since is broken, so I am reverting to check that it is just non-empty for now. I have seen essentially none that have been badly syntactically broken, when looking at the forensics logs, so I should cut the complexity.

Also it seems that Apache 2.4 is using If-Modified-Since for an inequality check, not a validator or exact match only as one might hope, which is good.

Instead as a temporary measure I am treating bogus very old If-Modified-Since dates as if they were in fact absent for the purposes of random 429 rejections. This is temporary since it might be hard to keep when/if ETags are back, and would also need manual updating to match too-old-to-be-real. This mainly affects one not-too-greedy client, though is drafted more generally.

More bugs from complexity

Whoops! My attempt to serve lite HTML pages to clients that would not accept compression went wrong and was preventing me sending precompressed HTML content to some specialised clients, eg that only offer br (this tester for example). A fix to make it more robust is:

-  # No precompression usable, or client does not support even gzip compression.
-  RewriteCond %{HTTP:Accept-Encoding} !gzip [OR]
+  # No precompression usable, or client does not support gzip or br compression.
+  RewriteCond %{HTTP:Accept-Encoding} !((gzip)|(br)) [OR]
   RewriteCond %{HTTP:Sec-CH-UA-Mobile} "[?]1" [OR]
   RewriteCond %{HTTP:Save-Data} on [NC]
   RewriteCond %{DOCUMENT_ROOT}/m%{REQUEST_FILENAME} -s

The tester still gives misleading results from being sent a lite page where it offers no compression model at all for a desktop page, but this is a very special case. Almost no real client will offer just br, or indeed any single method other than gzip or identity.

GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Host:www.earth.org.uk
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:gzip|Host:www.earth.org.uk
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:br|Host:www.earth.org.uk
GET / HTTP/2.0|Accept:*/*|User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.3497.100 Safari/537.36|Accept-Encoding:zstd|Host:www.earth.org.uk

This tester also suggests that there is no point in using zstd for static compression, and little for dynamic over gzip -6, at least in terms of bandwidth and CPU at my end.

Compressible extensions

I have also being extending the list of file/URL extensions that should get on-the-fly gzip if not already covered by something else:

AddOutputFilter DEFLATE bib
AddOutputFilter DEFLATE csv
AddOutputFilter DEFLATE dat
AddOutputFilter DEFLATE log
AddOutputFilter DEFLATE md
AddOutputFilter DEFLATE mid
AddOutputFilter DEFLATE rss
AddOutputFilter DEFLATE sh
AddOutputFilter DEFLATE svg
AddOutputFilter DEFLATE tsv
AddOutputFilter DEFLATE txt
AddOutputFilter DEFLATE vtt
AddOutputFilter DEFLATE xml

This should catch a few more stragglers.

Apple, what are you doing?

This sort of nonsense in my logs annoys me, suggesting that some Apple engineer wants to pretend that HTTP is a generic RPC mechanism and ignore most of HTTP's conventions and etiquette thus wasting lots of CPU and bandwidth (edited Apache mod_log_forensic extract):

HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
HEAD /img/wordcloud/podcast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
HEAD /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
HEAD /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
HEAD /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
HEAD /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:60000m
...

Mastodon not allowing Brotli

One small factor that I reported as a issue that could reduce the impact of Mastodon preview stampedes would be allowing Brotli compression of the HTML which would reduce data volume a little. I was told it did. Looking at a couple of forensics log entries suggests not, for fairly recent versions:

GET / HTTP/1.1|User-Agent:http.rb/5.1.1 (Mastodon/4.2.8; +https%3a//0rb.it/)|Host:www.earth.org.uk|Date:Sun, 12 May 2024 13%3a34%3a08 GMT|Accept-Encoding:gzip|Accept:text/html|Connection:close
GET / HTTP/1.1|User-Agent:http.rb/5.1.1 (Mastodon/4.2.7; +https%3a//m.trisweb.com/)|Host:www.earth.org.uk|Date:Sun, 12 May 2024 13%3a34%3a17 GMT|Accept-Encoding:gzip|Accept:text/html|Connection:close

Non-skipHours 429s

The non-skipHours random-fraction 429 rejections will now only happen during some sort of resource constraint, such as high-grid intensity or low battery. So now:

  • during skipHours, no-Referer unconditional requests, when grid intensity is high or battery is low, anything other than squeaky clean, may be rejected with 429
  • other than during noon UTC, no-Referer unconditional requests, when grid intensity is high or battery is low, will have a small fraction randomly rejected with 429
  • other than during noon UTC, no-Referer unconditional incompressible requests, will almost all be rejected with 406 to reduce bandwidth to closer to what allowing compression would have yielded

The success Cache-Control max-age and failure 429 Retry-After vary by time of day to try to steer hits away from skipHours, and are always many hours.

2024-05-11: Liter Podcast Feed

Given that the result of me pointing out to Spotify that its If-Modified-Since header was slightly broken seems to have been for it to be removed entirely, preventing 304 results, I have reduced the maximum number of entries in the podcast 'lite' feed to 5. That reduces wasted bandwidth a little.

100547 rss/podcast.rss
 11035 rss/podcast.rssgz
  9097 rss/podcast.rssbr
  5491 rss/podcast-lite.rss
  1745 rss/podcast-lite.rssgz
  1461 rss/podcast-lite.rssbr

Spotify seems to have reinstated conditional GETs (it is getting 304 responses again) but the deed is done.

I am capturing a little more forensics to see what is happening. Yes, it seems to be back, though the date is now double-digit anyway.

GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive

2024-05-12: If-Modified-Since stopped again

For no obvious reason, this afternoon:

GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Sat, 11 May 2024 18%3a06%3a34 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive

2024-05-10: Date Regex Relax

I have relaxed the regexes checking If-Modified-Since to accept 7 May as well as 07 May (see earlier) as a benign deviation from the spec that will allow a few more 304 responses.

# The official HTTP/1.1 (and newer) date format is fixed length of the form:
#     Tue, 07 May 2024 09:46:11 GMT
# and can be matched by the following:
#RewriteCond %{HTTP:If-Modified-Since} !"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\ ([0-3][0-9])\ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ (20[0-9][0-9])\ ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\ GMT$" [NV]
# In practice a few otherwise reasonably well-behaved clients drop
# any leading zero on the date, which Apache seems to be able to parse, eg:
#     Tue, 7 May 2024 09:46:11 GMT
# This should be able to be matched by the following:
#RewriteCond %{HTTP:If-Modified-Since} !"^(Mon|Tue|Wed|Thu|Fri|Sat|Sun),\ ([0-3]?[0-9])\ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ (20[0-9][0-9])\ ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\ GMT$" [NV]

2024-05-08: Sec-CH-UA-Mobile: ?1

For automatically selecting the lite/mobile version of main pages (but not necessarily of images for example) I now treat Sec-CH-UA-Mobile: ?1 as equivalent to Save-Data: on.

Note that Sec-CH-UA-Mobile is 'experimental', but adding/removing it is easy in this case (in three places):

+  RewriteCond %{HTTP:Sec-CH-UA-Mobile} "[?]1" [OR]
   RewriteCond %{HTTP:Save-Data} on [NC]

(I saw this header in the forensics, though it only was on 6 requests out of ~1100, and none of those were HTML pages.)

This mechanism seems to be forcing lite EOU pages for my Fairphone 3 running Android 13 and Chrome 124.

Googlebot's mobile / smartphone fetch does not seem to set this flag.

2024-05-07: mod_log_forensic

This evening's fun is to try Apache 2.4 module mod_log_forensic, enabling it with a2enmod log_forensic and in the EOU config something like:

<IfModule log_forensic_module>
ForensicLog ${APACHE_LOG_DIR}/forensic.log
</IfModule>

Immediately I see a snippet such as this:

|GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a22 GMT|

The 7 May rather than 07 May seems to be technically malformed for example, compared to:

|GET /rss/note-on-site-technicals.rss HTTP/1.1|Host:www.earth.org.uk|Connection:Keep-Alive|Accept-Encoding:gzip|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a26 GMT|User-Agent:SpaceCowboys Android RSS Reader / 2.6.24(309)

Should I really get into editing malformed headers? I think that Apache probably accepts both forms. But I think that I can treat this variant as malformed to reject some requests randomly. Nominally three formats are acceptable, and extra whitespace should be tolerated.

Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036
Sun Nov  6 08:49:37 1994       ; ANSI C's asctime() format

Apache itself generates the 07 format for Date and Expires and Last-Modified.

% wget --compression=auto --debug -S -O out https://www.earth.org.uk/sitemap.atom
...
---response begin---
HTTP/1.1 200 OK
Date: Tue, 07 May 2024 18:15:00 GMT
Server: Apache
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Tue, 07 May 2024 15:38:44 GMT
Vary: Accept-Encoding,Referer
Content-Encoding: gzip
Cache-Control: max-age=3600
Expires: Tue, 07 May 2024 19:15:00 GMT
X-Frame-Options: DENY
access-control-allow-origin: *
Content-Length: 1384
Keep-Alive: timeout=5, max=100
Content-Type: application/atom+xml

---response end---

Some Spotify requests get 304 responses, while others are rejected by the bad-unconditional-requests logic:

[07/May/2024:18:03:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0"
[07/May/2024:18:10:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[07/May/2024:18:17:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[07/May/2024:18:24:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[07/May/2024:18:31:46 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0"

gPodder

Interestingly this poking around seems to have found a minor issue in gPodder where it was retaining a stale ETag that had not been present in responses for several days.

Hurrah for forensics!

2024-05-06: 406/429 Noon Hole

There has just been more than 24h with high grid carbon intensity, and all iTMS (iTunes) poll attempts rejected with 429 or 406. I have added a noon hole, for an hour, when there is most likely to be some solar power off-grid and on, in the 406 defence. Even some of the semi-bad bots should be able to make a successful poll then. iTunes made it through the hole today!

...
[06/May/2024:11:21:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[06/May/2024:11:21:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[06/May/2024:12:10:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS"
[06/May/2024:12:10:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS"
[06/May/2024:12:10:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS"
[06/May/2024:13:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[06/May/2024:13:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
...

I have also reinstated an all-day 429 rule for unconditional GETs, and given it a matching noon hole. So the common brute-force, every-hour, but compression-enabled bots should be able to get through it at least daily.

ETag paranoia

I have also added to the top of the EOU config unsetting any If-None-Match request header to avoid interfering with If-Modified-Since, so now the block is:

# When upgrading to a new-enough (eg 2.5) Apache, enable below
# and re-allow ETags elsewhere eg for large static image / audio / video.
#DeflateAlterETag Remove
# Until Apache 2.5, disable ETag entirely for this site.
FileETag none
Header unset ETag
RequestHeader unset If-None-Match

If-Modified-Since paranoia

I am testing validating If-Modified-Since with a regex, for this and the 429 case, rather than not just being empty or missing, something like:

^(Mon|Tue|Wed|Thu|Fri|Sat|Sun), ([0-3][0-9]) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (20[0-9][0-9]) ([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9] GMT$

mod_dumpio

I am trying out Apache's mod_dumpio which seems to be available in my server's build. I expect to need to issue a sudo a2enmod dump_io to enable it, or the equivalent in my ansible setup if I would like the module permanently available (I probably do not). (And a2dismod to disable when I have finished...)

In any case, I am making my use conditional on the module being present, which should avoid breaking things either way.

<IfModule mod_dumpio.c>
DumpIOInput On
LogLevel alert dumpio:trace7
</IfModule>

Seemingly this has to be at server level, not virtual-host level...

I made all that work, but the output does not seem very helpful! So I have turned it all off again for now.

[Mon May 06 12:29:24.857553 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [init-blocking] 0 readbytes
[Mon May 06 12:29:24.912990 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.913114 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(151): [client 79.135.97.65:63570] mod_dumpio: dumpio_in - 11
[Mon May 06 12:29:24.913302 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.913359 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(151): [client 79.135.97.65:63570] mod_dumpio: dumpio_in - 11
[Mon May 06 12:29:24.913444 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-blocking] 65536 readbytes
[Mon May 06 12:29:24.922025 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): 148 bytes
[Mon May 06 12:29:24.922107 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(103): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n
[Mon May 06 12:29:24.922193 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.922461 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): 1521 bytes
[Mon May 06 12:29:24.922498 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(103): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT):
[Mon May 06 12:29:24.922905 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(140): [client 79.135.97.65:63570] mod_dumpio: dumpio_in [readbytes-nonblocking] 65536 readbytes
[Mon May 06 12:29:24.923008 2024] [dumpio:trace7] [pid 20553:tid 1962402864] mod_dumpio.c(63): [client 79.135.97.65:63570] mod_dumpio:  dumpio_in (data-TRANSIENT): 9 bytes
...

Time hash

I have adjusted the %{md5:%{TIME}} 406 randomiser to instead use TIME_MIN for its fastest-changing part so that a very quick (naughty) retry is more likely to get an unchanged result. (Eg from a bot that does a HEAD before each GET.) Thus bad behaviour should be less likely to get an undeserved 200. I could fold into the hash the IP address and/or User-Agent at a tiny extra CPU cost in order to reduce the risk of stampedes, but maybe also reducing (OS) cache effectiveness a tad.

This would make things harder for real humans attempting to sign up to the feed, or attempting a manual forced refresh, so is not used for 429 rejections, including daytime, of unconditional GETs.

Podbean in particular with its bad fast retry after 429 will evade this defence for now.

2024-05-05: Glossary Lite

I added a heat-pump entry to the glossary.

I also now have a lighter-weight version of the glossary built for non-desktop versions, showing at most one image per entry. Other trimming might make sense too.

2024-05-02: Random Early Drop 429

I now also randomly reject about half of unconditional GETs with 429s during skipHours even when everything else (eg battery) is OK. Half so that a human being manually signing up for the first time, or refreshing their feed, need typically retry once only. But even so, about halving the rest of the useless overnight bytes, and signalling more unhappiness clues for any engineer that looks. This should trim traffic from Amazon in particular.

RewriteCond expr "'%{md5:%{TIME}}' =~ /^[0-7]/" [OR]

I further extended this by just rearranging the clauses so that for any Referer-less unconditional fetch of an RSS feed file during skipHours, and for about 50% out of skipHours, the remainder of the 429 rejection rules come into play. So this should trim some daytime waste also.

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP:If-Modified-Since} ^$ [NV]
#RewriteCond %{HTTP:If-None-Match} ^$ [NV]
# Potentially reject ~50% at random to trim bandwidth, but allow manual retries.
RewriteCond expr "'%{md5:%{TIME}}' =~ /^[0-7]/" [OR]
# During skipHours.
RewriteCond "%{TIME_HOUR}" "<08" [OR]
RewriteCond "%{TIME_HOUR}" ">21"
# Not allowing compression is rude.
RewriteCond %{HTTP:Accept-Encoding} !gzip [OR]
# Not saying who you are (no User-Agent) is rude.
RewriteCond %{HTTP:User-Agent} ^$ [OR]
# Have any interaction with the filesystem as late as possible.
# Battery is low.
RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f [OR]
# Grid is high-intensity.
RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f
RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1]

I adjusted the 406 random rejection percentage down to compensate.

I also now have the Header always set Retry-After instruction set alongside (and to the same duration as) the ExpiresByType application/rss+xml dependent on the hour of day, in case the client actually reacts properly to 429! (Note the REDIRECT_ prefix on the variable name.)

<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
    # This should be long enough to jump out of skipHours in one go.
    ExpiresByType application/rss+xml "access plus 10 hours 7 minutes"
    Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT
</If>
<ElseIf "%{TIME_HOUR} -gt 17">
    # Jump expiry right over coming skipHours block.
    ExpiresByType application/rss+xml "access plus 14 hours 7 minutes"
    Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT
</ElseIf>
<Else>
    # Give podcast RSS and similar feeds a default expiry time of ~4h.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
    Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT
</Else>

This is what 429 rejection headers for wget look like:

HTTP/1.1 429 Too Many Requests
Date: Thu, 02 May 2024 12:25:37 GMT
Server: Apache
Retry-After: 14720
Vary: Accept-Encoding,Referer
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Sun, 28 Apr 2024 11:05:28 GMT
Accept-Ranges: bytes
Content-Length: 925
X-Frame-Options: DENY
Keep-Alive: timeout=5, max=100
Content-Type: text/html

And others (406):

HTTP/1.1 406 Not Acceptable
Date: Thu, 02 May 2024 12:32:36 GMT
Server: Apache
Vary: Accept-Encoding,Referer
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Sun, 28 Apr 2024 11:05:22 GMT
Accept-Ranges: bytes
Content-Length: 928
X-Frame-Options: DENY
Keep-Alive: timeout=5, max=100
Content-Type: text/html

And success (200):

HTTP/1.1 200 OK
Date: Thu, 02 May 2024 12:34:15 GMT
Server: Apache
Upgrade: h2
Connection: Upgrade, Keep-Alive
Last-Modified: Wed, 01 May 2024 12:04:22 GMT
Accept-Ranges: bytes
Content-Length: 100547
Vary: Accept-Encoding,Referer
Cache-Control: max-age=14820
Expires: Thu, 02 May 2024 16:41:15 GMT
X-Frame-Options: DENY
access-control-allow-origin: *
Keep-Alive: timeout=5, max=100
Content-Type: application/rss+xml

This is not so far trimming Amazon daytime traffic, however, because the the grid and battery SoC are fine!

It also seems (though I have yet to take detailed header dumps) that some clients may be faking or messing up conditional requests, thus evading some of my controls...

I have now separated out another 429 block that rejects 50% of Refer-less unconditional requests, and removed the 50% rejection from the main 429 rule. This more aggressively rejects traffic from such as Amazon and fyyd at all times.

iTunes panic

iTunes (iTMS) really dislikes being asked to cool it! After being shut out for a while it then unconditionally demands copies of all the (immutable) cover art, often duplicates in quick succession. Apple should be embarrassed about this poor implementation, I think.

[02/May/2024:16:32:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS"
[02/May/2024:16:32:26 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS"
[02/May/2024:16:32:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS"
[02/May/2024:16:32:27 +0000] "HEAD /img/wordcloud/podcast-1.png HTTP/1.1" 200 339 "-" "iTMS"
[02/May/2024:16:32:28 +0000] "HEAD /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:28 +0000] "GET /img/meet/RCK/people-posterised-sq-800w.png HTTP/1.1" 200 105606 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "HEAD /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "HEAD /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:29 +0000] "GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 30479 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "HEAD /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "GET /img/OpenTRV/DORM1-wall-scene-sq-800w.jpg HTTP/1.1" 200 30479 "-" "iTMS"
[02/May/2024:16:32:30 +0000] "HEAD /img/16WW/mkaudio/PV-generation-63M-Audacity-screenshot-2-sq-300w.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/16WW/mkaudio/PV-generation-63M-Audacity-screenshot-2-sq-300w.png HTTP/1.1" 200 1963 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "HEAD /img/16WW/measuring-mains-water-temperature-at-kitchen-sink-with-N84FR-1800w.jpg HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:31 +0000] "GET /img/audio/podcast-furniture/title/fieldrecording-1.png HTTP/1.1" 200 5207 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "HEAD /img/3rdParty/TTK-logo-sq-300w.jpg HTTP/1.1" 200 340 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "GET /img/3rdParty/TTK-logo-sq-300w.jpg HTTP/1.1" 200 54330 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "HEAD /img/storage/storage-sq-640w.png HTTP/1.1" 200 338 "-" "iTMS"
[02/May/2024:16:32:32 +0000] "GET /img/storage/storage-sq-640w.png HTTP/1.1" 200 3103 "-" "iTMS"
...

I have Amazon Music Podcast pull down (equally badly) a full set of cover art images...

Cold feet

After an overnight run I got cold feet about this 429 rule and I disabled it, bumping up the 406 rejection percentage a little instead.

I reinstated a milder version of this random rejection rule, so that it only happens in skipHours to unconditional no-Referer requests, and it is now one of the OR-ed subconditions that allows a 429:

  • no User-Agent
  • no gzip compression allowed by Accept-Encoding
  • ~50% random chance
  • battery low
  • grid intensity high

I cranked up a separate streamlined version of this 429 rule to apply at any time other than noon (noon UTC is a hole in the defences to match 406), and so the skipHours-only version can lose its random part.

The mixed 4xx status messages to clients, and occasional wrong delivery to potential new listeners is sub-optimal. Though it does correctly hint that there is more than one thing wrong with the effected requests.

The 429s should slow Amazon down a bit, immediately.

2024-05-01: Random Early Drop 406

For those RSS feed clients that do not allow compression, outside skipHours times I now aim to (kinda) Random Early Drop (RED) a high enough fraction of requests to roughly compensate for the wasted bandwidth in those allowed, ie about 75% or above (typically over 4x compression with gzip).

Effective randomness is achieved by matching a single digit of the MD5 hash of the full request date/time. This fairly cheap single line in the 406 rule can be tweaked in various ways if needed.

RewriteCond expr "'%{md5:%{TIME}}' =~ /^[^abc]/" [OR]

For example, successively running wget --debug -S -O out https://www.earth.org.uk/rss/podcast.rss, which does not offer/allow gzip compression, gives:

[02/May/2024:08:03:34 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 104560 "-" "Wget/1.24.5"
[02/May/2024:08:03:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:03:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:03:58 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:00 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:03 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 104560 "-" "Wget/1.24.5"
[02/May/2024:08:04:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"
[02/May/2024:08:04:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 406 4635 "-" "Wget/1.24.5"

More pertinently on a sample of Apple iTunes requests:

[02/May/2024:08:22:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:08:22:38 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[02/May/2024:08:37:47 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:08:37:48 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[02/May/2024:08:51:40 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:08:51:41 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[02/May/2024:09:04:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3571 "-" "iTMS"
[02/May/2024:09:04:25 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 384 "-" "iTMS"
[02/May/2024:09:04:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 101134 "-" "iTMS"
[02/May/2024:09:15:59 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3452 "-" "iTMS"
[02/May/2024:09:15:59 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
~4273 words.