Earth Notes: On Website Technicals (2024-06)
Updated 2024-07-07 14:33 GMT.By Damon Hart-Davis.
2024-06-26: skipDays
So I have added the following to my non-podcast RSS feeds, given that I am least likely to update them on Sunday (or at least the world can wait until Monday anyhow):
<skipDays><day>Sunday</day></skipDays>
I may extend skipDays
to my podcast feed in due course.
2024-06-28: added to podcast feed
Nothing has complained yet, so I am adding skipDays
to the podcast feed this afternoon.
2024-06-30: facts
Actual data suggests that Sunday is the most popular podcast day to publish, Tuesday and Thursday the least, so I will adjust skipDays
to match:
% sed -n < rss/podcast.rss -e 's/^.*<pubDate>\([A-Z][a-z][a-z]\),.*$/\1/gp' | sort | uniq -c | sort -n 4 Tue 5 Thu 6 Fri 7 Wed 9 Sat 11 Mon 24 Sun
I could automate the skipDays
generation, but there are some subtle things at work such as I would probably want to avoid adjacent skipDays
to cap publishing latency to at most one day.
2024-07-07: Saturday
I am now adding Saturday to the skipDays
, alongside Tuesday and Thursday, so nearly halving the days on which good bots (and other clients) should poll, at the risk of delaying the odd episode reaching its listeners by a day.
2024-06-24: DDoS
Just before 10pm a smallish number of IP addresses, globally distributed, started making about 60 nonsense HTTP requests per second against one of the domains hosted on the same server as EOU. That crowded out EOU requests, denying service to legitimate clients. It also caused Apache to burn about a full CPU's worth of processing power. It may also have partly saturated the Internet connection. It was also generating a huge amount of log traffic, partly contributing to the DoS since the microSD card is not that fast, and was also potentially causing significant write wear on the card.
With a number of changes over a couple of hours, including withdrawing some key DNS information, adjusting logging, and enabling some router firewall rules, the attack became invisible from the EOU server. (It subsequently set off a number of alarms in Google Search Console, as did the planned temporary shut-out of Googlebot in robots.txt
.) There may still be nuisance traffic upstream, wasting my ISP's resources.
2024-06-22: Googlebot Still Rogue
Podcast RSS feed poll hits today/recent (22/Jun/2024) at 2024-06-22T14:02Z.
Estimated Hits per Day | Partial User-Agent |
---|---|
1028 | Mozilla/5.0 (comp
|
193 | Spotify/1.0
|
177 | iTMS
|
123 | Podbean/FeedUpdat
|
92 | Gofeed/1.0
|
62 | PocketCasts/1.0 (
|
60 | Amazon Music Podc
|
44 | -
|
30 | Mozilla/5.0 (Wind
|
24 | itms
|
22 | Overcast/1.0 Podc
|
22 | fyyd-poll-1/0.5
|
22 | axios/1.5.1
|
19 | Mozilla/5.0
|
16 | Mozilla/5.0 (Maci
|
(Top 15 pollers of the podcast feeds by hits...)
Stuff I would not have to expect to do to robots.txt
:
# DHD20240620: Googlebot had been going wild (~1000 polls/day) on podcast.rss. #Sitemap: https://www.earth.org.uk/rss/podcast.rss # DHD20240622: attempting to block access entirely. User-agent: Googlebot Disallow: /rss/podcast.rss
Via GSC I forced a reload of the updated (as above) robots.txt
at ~. These should have been some of the final crawls:
[22/Jun/2024:19:16:23 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:17:23 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:18:24 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:19:24 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:21:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:22:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:23:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:24:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:27:02 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:19:28:02 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ... [22/Jun/2024:20:39:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:20:42:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:20:43:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:20:44:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ...
Still polling nearly every minute at 20:55Z...
Oh, whoops — multiple polls per minute...
[22/Jun/2024:21:09:31 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:10:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:10:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:10:32 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:11:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:11:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:12:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:12:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:12:44 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:13:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:13:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:13:44 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ... [22/Jun/2024:21:19:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:19:28 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:19:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [22/Jun/2024:21:19:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4133 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ...
2024-06-23
Still ignoring robots.txt
next morning...
org.uk:443 66.249.66.43 - - [23/Jun/2024:06:29:35 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" org.uk:443 66.249.66.41 - - [23/Jun/2024:06:30:35 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" org.uk:443 66.249.66.42 - - [23/Jun/2024:06:31:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" org.uk:443 66.249.66.42 - - [23/Jun/2024:06:32:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 522 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" org.uk:443 66.249.66.43 - - [23/Jun/2024:06:35:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
In desperation at ~09:24Z I have blocked Googlebot from the whole EOU site in robots.txt
, temporarily, to see if that stops the stupidity.
User-agent: Googlebot Disallow: /
The is some sign ~3h later that Googlebot's crawling is slowing, though much is still pulling down the feed:
- [23/Jun/2024:12:15:10 +0000] "GET /img/a/b/eddi-22621e2a2b77a47c6a1bbc596b91e1d7.l5253802.720x960.jpg HTTP/1.1" 304 3619 "-" "Googlebot-Image/1.0" [23/Jun/2024:12:19:33 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13712 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:12:20:20 +0000] "GET /sitemap.atom HTTP/1.1" 200 1680 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:12:21:48 +0000] "GET /data/consolidated/energy/std/con/M/Enphase/con-M-Enphase.csv HTTP/1.1" 200 4658 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.175 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:12:24:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13712 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:12:30:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13712 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
By ~14:00Z Googlebot seems to be at about one fetch every 10 minutes; there were about 10 over the last hour.
By ~15:30Z it is almost quiet, other than Googlebot-Image
crashing through, messily...
At 16:10Z I have seen nothing of Googlebot for nearly an hour, so I am going to take off the full site block in robots.txt
, blocking only the feed file:
User-agent: Googlebot Disallow: /rss/podcast.rss
I am asking GSC to refresh only the
https://www.earth.org.uk/robots.txt
for now. The views of the same file (eg under http://
) are relatively unimportant and can refresh on their normal daily cycle.
The creature stirs...
[23/Jun/2024:15:14:52 +0000] "GET /rss/datafeed.atom HTTP/1.1" 200 5036 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:15:29:45 +0000] "GET /img/solar-cells.jpg HTTP/1.1" 302 554 "-" "Googlebot-Image/1.0" [23/Jun/2024:15:29:45 +0000] "GET /img/solar-cells.jpg HTTP/1.1" 302 554 "-" "Googlebot-Image/1.0" [23/Jun/2024:15:29:46 +0000] "GET /img/solar-cells.jpg HTTP/1.1" 200 332003 "-" "Googlebot-Image/1.0" [23/Jun/2024:15:29:46 +0000] "GET /img/solar-cells.jpg HTTP/1.1" 200 332003 "-" "Googlebot-Image/1.0" [23/Jun/2024:16:11:36 +0000] "GET /robots.txt HTTP/1.1" 200 4422 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:16:11:36 +0000] "GET /robots.txt HTTP/1.1" 200 1062 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.175 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:16:11:37 +0000] "GET /robots.txt HTTP/1.1" 200 4422 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:17:07:26 +0000] "GET /live-grid-tie-stats.html HTTP/1.1" 301 597 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.175 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:17:07:27 +0000] "GET /_live-grid-tie-stats.html HTTP/1.1" 200 1762 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.175 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:19:23:54 +0000] "GET /rss/datafeed.atom HTTP/1.1" 200 5036 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [23/Jun/2024:20:42:01 +0000] "GET /img/LocalBytes-monitoring-plug/2022-10-29-Cold-Wash/LBplug-cold-wash.svg HTTP/1.1" 200 2093 "https://m.earth.org.uk/" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.6422.154 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(The /_live-grid-tie-stats.html
requests were over http:
, so may have been a final twitch from the before times... The next two are also http:
.)
As of ~21:00Z, as Googlebot is still a little shy, I am forcing reloads of some of the aliases now.
www.earth.org.uk:80 66.249.66.40 - - [23/Jun/2024:21:28:46 +0000] "GET /SECTION_microgen.html HTTP/1.1" 200 4049 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.66.169 - - [23/Jun/2024:21:33:45 +0000] "GET / HTTP/1.1" 301 3915 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:443 66.249.66.160 - - [23/Jun/2024:21:36:14 +0000] "GET / HTTP/1.1" 301 3915 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:80 66.249.66.42 - - [23/Jun/2024:21:51:07 +0000] "GET /sitemap.html HTTP/1.1" 200 12887 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:80 66.249.66.40 - - [23/Jun/2024:21:53:36 +0000] "GET /sitemap.html HTTP/1.1" 304 239 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:80 66.249.66.41 - - [23/Jun/2024:22:05:46 +0000] "GET /robots.txt HTTP/1.1" 200 1034 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.earth.org.uk:80 66.249.66.41 - - [23/Jun/2024:22:05:46 +0000] "GET /note-on-solar-DHW-for-16WW.html HTTP/1.1" 200 21699 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I have reinstated the feed file as a sitemap in robots.txt
. Googlebot will not be able to use it for now, but other things might.
Attempting in GSC at ~23:00Z a TEST LIVE URL
on an EOU desktop https:
URL results in:
Page cannot be indexed: Blocked by robots.txt
.
2024-06-24
At ~06:30Z a little https:
crawling is happening, but an attempt to test another EOU desktop https:
URL results in
Blocked
still. (By 13:30Z I can issue a live test and request indexing again, and by ~14:15Z Googlebot is crawling a page roughly every minute.)
No sign of restarting the monstrous bashing away on the RSS podcast feed file.
Amusingly Google-Podcast
is showing up with hourly polls (thus also ignoring Cache-Control
etc), even though the Google Podcast service is dead as of about today worldwide.
Podcast RSS feed poll hits today/recent (24/Jun/2024) at 2024-06-24T08:23Z.
Estimated Hits per Day | Partial User-Agent |
---|---|
208 | iTMS
|
192 | Spotify/1.0
|
93 | Gofeed/1.0
|
85 | -
|
56 | Podbean/FeedUpdat
|
56 | PocketCasts/1.0 (
|
53 | Mozilla/5.0 (Linu
|
45 | Amazon Music Podc
|
40 | Mozilla/5.0 (Wind
|
26 | axios/1.6.8
|
24 | itms
|
24 | Google-Podcast
|
24 | fyyd-poll-1/0.5
|
21 | Overcast/1.0 Podc
|
21 | deezer/curl-3.0
|
2024-06-20: Tightening Up On Very Bad Bots
I am getting more annoyed by the worst-behaved bots, so they are going to get more 429
s when only somewhat bad bot would be allowed a pass. I am tightening up the rules a bit, by degrees.
I also removed the podcast feed as a sitemap in GSC (Google Search Console), and removed it as a sitemap in robots.txt
. This in the hope that those changed may help temper Googlebot's misbehaviour, ie ~1000 hits a day on the feed file alone which is about twice that of the rest of EOU.
2024-06-18: RSS Winter Slower Poll
I am meant to be working through a huge UK government consultation, so I am instead being creative with displacing my attention elsewhere.
I have added a clause (4) that is only active in winter when energy is short; the default cache/retry time becomes just under 1 day. This and clause (2) try to drift clients towards early afternoon for polling.
# Set RSS feed cache time, ie minimum poll interval. # Give podcast RSS and similar feeds longer expiry out of work hours. # Usually have any Retry-After match what the expiry would have been. <If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21"> # In skipHours. # Long enough to jump out of skipHours in one go. ExpiresByType application/rss+xml "access plus 10 hours 7 minutes" Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT # Reduce Atom feed processing at night too. ExpiresByType application/atom+xml "access plus 10 hours 7 minutes" </If> <ElseIf "%{TIME_HOUR} -lt 12"> # Coming up to the open hour at 12XXZ. # Normal general expiry time of ~4h (>>1h). # But try to funnel bad bots getting 429/503 to the noon 'open' slot. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" Header always set Retry-After "3420" env=REDIRECT_RSS_RATE_LIMIT </ElseIf> <ElseIf "%{TIME_HOUR} -gt 17"> # Coming up to the start of skipHours. # Jump expiry right over coming skipHours block. ExpiresByType application/rss+xml "access plus 14 hours 7 minutes" Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT </ElseIf> <ElseIf "%{TIME_MON} -lt 3 || %{TIME_MON} -gt 10"> # Winter: defer next poll just under 1d: creep back to ~noon. ExpiresByType application/rss+xml "access plus 23 hours 57 minutes" Header always set Retry-After "86220" env=REDIRECT_RSS_RATE_LIMIT </ElseIf> <Else> # Give podcast RSS and similar feeds a default expiry time of ~4h. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT </Else>
Note that this is also selectively applied to Atom feeds, even sitemap.atom
, in night-time skipHours
.
I note that plenty of Atom clients appear to ignoring this new Retry-After
.
2024-06-17: RSS Traction?
In A saving bandwidth special! Podnews has done a good write-up of the nonsense that I have been banging on about! I hope that James is OK with me lifting this big block of text, but he is right on the nose (my emphasis):
Spotted by Tom Rossi of Buzzsprout, Spotify has implemented “conditional GET requests” for RSS feeds using Etags. According to Tom Rossi of Buzzsprout, this has already resulted in 87% less RSS bandwidth being consumed for Spotify and Buzzsprout.
Many podcast hosts support an Etag in RSS headers, which only changes when the file itself has changed. In our RSS feed, it’s worked out using the last update time. Spotify sends the last Etag it saw in an if-none-match request; if it’s the same as the current Etag we see, we just give them a “304” code (“not modified”) in return, and do not generate and send the whole RSS feed. This saves processing power for both of us, as well as significant bandwidth: we get 206 checks a day from Spotify to our 32KB RSS feed, but only publish once a day. This change will result in savings for us of 99.5% of the 6.5MB RSS bandwidth we use to Spotify alone. (That’s just for one show!)
Not every podcast directory supports if-none-match. AntennaPod, PodcastRepublic, Gaana, Amazon, Slack, PodcastAddict and many others don’t send that header, resulting in significantly more bandwidth for us (and for them). We’re thankful for Spotify for turning it on.
Another way to save bandwidth on RSS feeds is to ensure they use gzip or brotli compression. RSS feeds are highly compressible, and compressing our podcast feeds should save 89% of our RSS bandwidth bill. Surprisingly, not every podcast host supports compression; but most do. If you’re a podcast host, you can test over here with the URL of one of your RSS feeds.
Podnews wasn’t supporting Gzip for our RSS feed (a mistake); it was accounting for 3.86GB of data per day. We turned it on last week; our bandwidth bill for our RSS feed has more than halved, and is now 1.22GB. We also now fetch feeds with gzip where we can.
Brotli is better than Gzip. As of today, Podnews supports both Brotli and Gzip compression. Cloudfront supports both.
A third way to save bandwidth is to use the alternateEnclosure in Podcasting 2.0 RSS feeds to offer a lower-bitrate version of audio to podcast players that support it. PodLP does; by default, it uses the Opus version of our audio (762KB) rather than our mp3 (7.1MB). That means we (and listeners in developing countries) get an 89% saving in audio bandwidth, unless they choose to get the MP3 version.
You’ll need to generate the additional low-bitrate version, of course. We do that automatically on publishing.
Hurrah!
Note that on Apache 2.4 as server, ETag basically does not work without huge care and/or messy workarounds. So Last-Modified / If-Modified-Since is the thing to use.
2024-06-15: Continuous Attack
My tiny host looking after the Thermino is under noticeable strain this morning. Nearly 200k entries in the auth.log
over the last week or so:
... Jun 15 07:29:05 pekoe sshd[2660]: Disconnected from authenticating user root 106.14.149.85 port 54710 [preauth] Jun 15 07:29:05 pekoe sshd[2662]: Disconnected from authenticating user root 43.159.145.80 port 55926 [preauth] Jun 15 07:29:05 pekoe sshd[2658]: Received disconnect from 111.70.13.212 port 35544:11: Bye Bye [preauth] Jun 15 07:29:05 pekoe sshd[2658]: Disconnected from authenticating user root 111.70.13.212 port 35544 [preauth] Jun 15 07:29:06 pekoe sshd[2673]: Invalid user guest from 167.71.229.36 port 48808 Jun 15 07:29:06 pekoe sshd[2672]: Invalid user debian from 138.204.127.54 port 44399 Jun 15 07:29:06 pekoe sshd[2673]: Received disconnect from 167.71.229.36 port 48808:11: Bye Bye [preauth] Jun 15 07:29:06 pekoe sshd[2673]: Disconnected from invalid user guest 167.71.229.36 port 48808 [preauth] Jun 15 07:29:06 pekoe sshd[2672]: Received disconnect from 138.204.127.54 port 44399:11: Bye Bye [preauth] Jun 15 07:29:06 pekoe sshd[2672]: Disconnected from invalid user debian 138.204.127.54 port 44399 [preauth] Jun 15 07:29:07 pekoe sshd[2677]: Invalid user lixiang from 43.135.134.197 port 56530 Jun 15 07:29:07 pekoe sshd[2659]: Invalid user nokia from 119.28.118.4 port 46672 Jun 15 07:29:07 pekoe sshd[2676]: Invalid user moderator from 138.97.64.134 port 38692 Jun 15 07:29:07 pekoe sshd[2677]: Received disconnect from 43.135.134.197 port 56530:11: Bye Bye [preauth] Jun 15 07:29:07 pekoe sshd[2677]: Disconnected from invalid user lixiang 43.135.134.197 port 56530 [preauth] Jun 15 07:29:07 pekoe sshd[2676]: Received disconnect from 138.97.64.134 port 38692:11: Bye Bye [preauth] Jun 15 07:29:07 pekoe sshd[2676]: Disconnected from invalid user moderator 138.97.64.134 port 38692 [preauth] ...
Outbound bandwidth
I asked someone with a ~1GBps connection to try pulling down a large (~400MB) file from EOU, not expecting to hit my 20Mbps outbound bandwidth, He saw a touch over 2MBps (~3 mins) so it seems that the RPi3B can pretty much saturate the outbound link (16WW has an 80/20 FTTC connection).
2024-06-17: max Mbps
Testing locally, ie without leaving the LAN, best of a couple of tries:
sencha% wget -O /dev/null http://www.earth.org.uk/out/monthly/public-data-files.tar.xz /dev/null 100%[===================>] 396.06M 10.6MB/s in 38s sencha% wget -O /dev/null https://www.earth.org.uk/out/monthly/public-data-files.tar.xz /dev/null 100%[===================>] 396.06M 10.6MB/s in 38s macbook2G% wget -O /dev/null https://www.earth.org.uk/out/monthly/public-data-files.tar.xz /dev/null 100%[===================>] 396.06M 1.37MB/s in 4m 47s
The MacBook Air's connection above is over 2.3GHz WiFi, via the router; not even reaching 20Mbps over the LAN!
Trying again over 5G WiFi ... much better:
macbook5G% wget -O /dev/null https://www.earth.org.uk/out/monthly/public-data-files.tar.xz /dev/null 100%[===================>] 396.06M 10.8MB/s in 39s
It seems as if the RPi3B could maybe support ~10MBps (~80Mbps) outbound, with no visible encryption penalty.
2024-06-11: Feed Defence Trimming
To make it more reliable to sign up as a new podcast user/listener, or to make an occasional unconditional poll of the feed, I have tweaked things so that you should only get a 429
(or 503
) outside skipHours
if you are seen to be a bad bot from recent days' traffic.
You may still get randomly pushed back at any time with a 406
for not allowing compression to save bandwidth, whether known bad bot or not, but that is easy to avoid (allow gzip
!) and is a good thing to fix.
During skipHours
you may still get a 429
, bad bot or not, if battery is low or grid is high carbon intensity.
This may warrant reducing the threshold to allow less than one hit per hour at some point, to try to steer those bots to be more climate friendly.
2024-06-12: 5 per hour
As of early (~06:00Z) this morning, so still within skipHours
, filtering out the 429
and 503
responses, so leaving almost only 200
and 304
it seems, I was seeing about five requests per hour go through from a variety of sources. That 'good' residue may be acceptable for an audience of ~50 listens which
op3.dev
claimed (for May).
% egrep rss/podcast /var/log/apache2/other_vhosts_access.log | egrep -v '" ((503)|(429)) ' | tail [12/Jun/2024:05:51:18 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11924 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" [12/Jun/2024:05:56:19 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11924 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" [12/Jun/2024:05:56:33 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 250 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)" [12/Jun/2024:06:00:07 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 113 "-" "PodcastAddict/v5 (+https://podcastaddict.com/; Android podcast app)" [12/Jun/2024:06:07:33 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12024 "-" "Mozilla/5.0 (compatible; FlipboardRSS/1.2; +http://flipboard.com/browserproxy)" [12/Jun/2024:06:46:14 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11924 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" [12/Jun/2024:06:52:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 301 3956 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [12/Jun/2024:07:00:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 194 "-" "Aggrivator (PodcastIndex.org)/v0.1.7" [12/Jun/2024:07:15:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3456 "-" "TPA/1.0.0" [12/Jun/2024:07:25:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 250 "-" "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)"
2024-06-10: Feed Bot Funnel
It looks as if Podbean may pay some attention to 503
responses. It made no requests after about 02:00Z until about 11:00Z — bliss! (After solid 503 responses for 4h from 22:00Z.) I might force any client with a significant number of skipHours
requests to 503
s then, as presumably ignoring gentler rejections such as 429
s.
429
and 503
recipients that pay attention to Retry-After
.
<ElseIf "%{TIME_HOUR} -lt 12"> # Normal general expiry time (>>1h). # But try to funnel bad bots getting 429/503 to the noon 'open' slot. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" # For 429. Header always set Retry-After "3420" env=REDIRECT_RSS_RATE_LIMIT # For 503. Header always set Retry-After "3420" env=RSS_RATE_LIMIT </ElseIf>
It occurs to me that rather than the extra processing to hand 503
s to especially bad bots, especially in skipHours
, a stateless mechanism could probably switch randomly between returning 503
and 429
in hourly blocks. Hourly is big enough that a retry after a minute as is typical for these bots will usually get the same response code. If the bad bot pays attention to the Retry-After
in either, or just backs off more anyway, then the result would probably be similar. This could be done in the main 429 skipHours
block, with careful use of Skip. That would reduce the amount of processing potentially done for every request.
Other 304s
I note from the GSC that nearly half of the requests to the m-dot EOU site are receiving 304
responses. In part this may be to understanding the whole Apache ETag
bug better. Also I have resisted site-wide updates recently, with most such pages untouched since .
2024-06-08: Googlebot Still Rogue
Googlebot is still fetching the RSS podcast feed file nearly 1000 times per day on average. Next is iTunes, previously villain of the piece, at ~200.
I have tweaked the greedy MD5 flags to have those for the top three by hits to be non-zero size, which potentially allows stiffer action against them in the Apache configuration.
I intend to return 503
s rather than 429
s (with the same Retry-After
value) to those bad bots during at least skipHours
. If this means that Googlebot spiders the rest of the EOU site too less at such times, that might be a good thing. I have already explicitly configured Bing to ease up crawling overnight.
Google Search Console (GSC) does not seem to show 429
, so maybe they are (maybe badly) handled at a lower level in the code.
Initial (daylight) testing suggests that both Podbean and Googlebot come back within about 60 seconds of a 503
also, so this extra complexity and tier of push-back may not achieve anything, other than slowing down all requests a little... Note that the Retry-After
header was not being set for the
503
responses.
[08/Jun/2024:17:04:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:05:04 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:05:08 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3335 "-" "iTMS" [08/Jun/2024:17:05:09 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3335 "-" "iTMS" [08/Jun/2024:17:05:22 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 15450 "-" "itms" [08/Jun/2024:17:05:45 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 860 "-" "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio" [08/Jun/2024:17:05:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:06:07 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:07:09 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:07:34 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4259 "-" "Spotify/1.0" [08/Jun/2024:17:07:40 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:07:48 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:08:40 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:08:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:09:41 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:09:54 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:10:41 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:10:57 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 382 "-" "Podbean/FeedUpdate 2.1" [08/Jun/2024:17:12:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3850 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [08/Jun/2024:17:12:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3599 "-" "Mozilla/5.0"
The new config is in part:
RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" RewriteCond %{HTTP_REFERER} ^$ # Bot's hashed UA appears in flags dir and is non-zero size? RewriteCond expr "-s '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" RewriteRule "^/rss/.*\.rss$" - [L,R=503,E=RSS_RATE_LIMIT:1] #... <If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21"> # This should be long enough to jump out of skipHours in one go. ExpiresByType application/rss+xml "access plus 10 hours 7 minutes" # For 429. Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT # For 503. Header always set Retry-After "36420" env=RSS_RATE_LIMIT </If> <ElseIf "%{TIME_HOUR} -gt 17"> # Jump expiry right over coming skipHours block. ExpiresByType application/rss+xml "access plus 14 hours 7 minutes" # For 429. Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT # For 503. Header always set Retry-After "50620" env=RSS_RATE_LIMIT </ElseIf> <Else> # Give podcast RSS and similar feeds a default expiry time of ~4h. ExpiresByType application/rss+xml "access plus 4 hours 7 minutes" # For 429. Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT # For 503. Header always set Retry-After "14720" env=RSS_RATE_LIMIT </Else>
A quick verification (with the time limits temporarily adjusted):
% wget -U "iTMS" -S -O /dev/null --compress=auto https://www.earth.org.uk/rss/podcast.rss --2024-06-08 21:30:53-- https://www.earth.org.uk/rss/podcast.rss Resolving www.earth.org.uk (www.earth.org.uk)... 79.135.97.78 Connecting to www.earth.org.uk (www.earth.org.uk)|79.135.97.78|:443... connected. HTTP request sent, awaiting response... HTTP/1.1 503 Service Unavailable Date: Sat, 08 Jun 2024 20:30:54 GMT Server: Apache Retry-After: 50620 Content-Length: 299 Connection: close Content-Type: text/html; charset=iso-8859-1 2024-06-08 21:30:54 ERROR 503: Service Unavailable.
I have this evening directly reported the issue (as well as via the GSC console and to developer/SEO liaison folks):
Googlebot is generally sensible, but is pulling rss/podcast.rss every minute, even if given a 429 or 503 response code with a long Retry-After delta-seconds header.
2024-06-09: continued
Even with hours of the Retry-After
header on 503
s, Googlebot is just as insistent:
[09/Jun/2024:05:36:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [09/Jun/2024:05:36:42 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4259 "-" "Spotify/1.0" [09/Jun/2024:05:37:38 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [09/Jun/2024:05:38:39 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [09/Jun/2024:05:39:05 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3456 "-" "gPodder/3.11.4 (+http://gpodder.org/) Windows" [09/Jun/2024:05:39:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 860 "-" "Amazon Music Podcast" [09/Jun/2024:05:39:39 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [09/Jun/2024:05:41:11 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 113 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" [09/Jun/2024:05:41:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" [09/Jun/2024:05:42:52 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3870 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
And Podbean is differently broken, maybe less bad than with 429
s (each block of 4 fetches is from a different IP address):
[09/Jun/2024:02:03:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1" [09/Jun/2024:02:04:53 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1" [09/Jun/2024:02:05:56 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1" [09/Jun/2024:02:06:59 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1" ... [09/Jun/2024:04:04:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "Podbean/FeedUpdate 2.1" [09/Jun/2024:04:05:54 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 389 "-" "Podbean/FeedUpdate 2.1" [09/Jun/2024:04:06:57 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 389 "-" "Podbean/FeedUpdate 2.1" [09/Jun/2024:04:07:59 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 389 "-" "Podbean/FeedUpdate 2.1"
Note that this is distinct from the somewhat better-behaved
FeedFetcher-Google
:
[06/Jun/2024:02:55:36 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13424 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" [06/Jun/2024:07:23:08 +0000] "GET /rss/podcast.rss HTTP/1.1" 301 3956 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" [06/Jun/2024:07:23:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4144 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" [06/Jun/2024:08:23:50 +0000] "GET /rss/podcast.rss HTTP/1.1" 301 3956 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" [06/Jun/2024:08:23:51 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 13424 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)" [09/Jun/2024:09:34:44 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 10005 "-" "FeedFetcher-Google; (+http://www.google.com/feedfetcher.html)"
2024-06-03: Bytes Trimmed
While I would like the error page for 404
to be human friendly, it is a shame to waste bandwidth on errors that humans will almost never see, and which may not even be compressed on the wire!
(Though I would like techies that see even the latter to be helped by them.)
I just realised that the mobile versions are still more or less styled right even for the desktop site, and are smaller:
% ls -alS {,m/}429.html{,gz,br} 1025 429.html 874 m/429.html 578 429.htmlgz 527 m/429.htmlgz 416 429.htmlbr 385 m/429.htmlbr
So for minimal extra effort I have told Apache to use the mobile versions for 406
and 429
:
# Serve the 'mobile' versions of some error pages to save a few bytes. # (Often no human will look at them anyway.) # Custom 404 page. ErrorDocument 404 /404.html # Custom 406 page. ErrorDocument 406 /m/406.html # Custom 429 page. ErrorDocument 429 /m/429.html
At least in some cases this will be defeated by anti-traffic-analysis padding in TLS.
Which would all be lovely, except that it does not work, and gets turned into a 301 permanent redirect to the specified HTML pages. Close, but no cigar!
So I have done the boring thing instead, and trimmed the messages to a minimum!
% ls -alS {,m/}429.html{,gz,br} 881 429.html 730 m/429.html 505 429.htmlgz 452 m/429.htmlgz 392 429.htmlbr 327 m/429.htmlbr
Here is a before and after for Googlebot; 24 bytes of Brotli-compressed response saved, nominally:
[03/Jun/2024:16:19:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4168 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" ... [03/Jun/2024:16:22:19 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 4144 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"