Earth Notes: RSS Podcast Feed Inefficiency
Updated 2024-11-11 12:39 GMT.By Damon Hart-Davis .
TL;DR: Feed Efficiency Suggestions for Aggregators and Other RSS Clients
If you are implementing something that pulls an RSS (eg podcast) or Atom feed, please follow as many of the below as you reasonably can, in roughly this priority order, to easily help save a huge amount of bandwidth, CPU, money and the climate:
- use
Cache-Control max-age
HTTP headers [IETFRFC9111] for a "do not poll again before" time: savings of 10x or much more are likely if the feed server is set up well: an unnecessary feed poll avoided entirely is the cheapest kind! - use a local cache and conditional GET (eg send
If-Modified-Since
and/orETag
HTTP headers [IETFRFC9110]): savings of 10x or more are likely - allow compression of the feed that you pull down (set
Accept-Encoding
HTTP headers [IETFRFC9110]) with at leastgzip
: savings of 2x to 10x are likely - avoid fetching the feed on
skipHours
(and/orskipDays
[RAB2009RSS]) in an RSS feed: savings of 2x are plausible, and can be especially renewables/climate friendly - use error responses
429
(Too Many Requests
) and503
(Service Unavailable
)Retry-After
header for a "do not poll again before" time (likeCache-Control max-age
above) when present: do NOT retry immediately/faster/repeatedly!
These are aimed at first reducing the the number of fetches made with their overheads including establishing connections and waking CPUs, and then reducing the bytes per transfer thus reducing the encryption and parsing and traffic-handling effort.
There are other signals available for extra marks, such as update frequency hints in the RSS file, and the pattern of published episodes!
(Those serving RSS feeds should make sure that
at least gzip
compression is available, conditional GET
s work properly, so most Apache-2.4-hosted RSS feeds should turn off ETag
s for example, and set a feed expiry to match when new episodes are published. A few rules in your HTTP server configuration can reduce feed bandwidth demand ~3x without making many waves, for slow-changing feeds.)
Abstract
Keywords
RSS, podcasting, efficiency, carbon-aware, frugality, sufficiency, climate, skipHours, Cache-Control, rate limiting
Dataset
Data is available under a CC0 licence (effectively public domain). A copy of the key data (and code) is filed with a DOI in a public repository.
(While work is in progress only partial data is available; some extra anonymisation is anticipated before publication of the full dataset.)
- name
- EOU podcast RSS feed usage inefficiency
- description
- Statistics on bandwidth and HTTP poll frequency and implied waste from comment clients of the Earth.Org.UK podcast RSS feed.
- version
- 1
- keywords
- podcast, RSS, feed, statistics, inefficiency, climate, skipHours, Cache-Control, rate limiting
- variable measured
- bandwidth
- variable measured
- poll frequency
- date created
- 2024-04-02T12:17:31Z
- date published
- 2024-04-02T18:00Z
- date modified
- 2024-11
- temporal coverage
- 2024-03-25T00:04:09Z/..
- spatial coverage
- UK centre 51.406696N,-0.288789E elevation 16m
- distribution
- directory hierarchy of stats files in various mainly plain-text formats
- distribution
-
DOI:
10.5281/zenodo.13292718
ZIP file (including code) and manifest at Zenodo - canonical URL
- this descriptive text with markup
- is part of
- 16WW Dataset
- licence
- this dataset is licensed under CC0, ie it is effectively public domain; if you make use of this data, attribution is welcome but not obligatory
- is accessible for free
- true
Introduction
IN PROGRESS
Working Notes
This describes work in progress.
Note that
podping
is not in scope for this work as it introduces a central service dependency and may simply hide poor behaviour further upstream.
Summary stats
Interval | Feed/day | % of EOU site traffic | Selected events | |||
---|---|---|---|---|---|---|
Ending | Days | Hits | MBytes | Hits | Bytes | |
2024-04-01T06:25Z | 8 | 1077 | 16.8 | 7.5 | 1.5 | |
2024-04-15T06:25Z | 8 | 1205 | 21.9 | 8.3 | 2.3 | |
2024-04-21T06:25Z | 6 | 1027 | 11.6 | 4.2 | 0.9 | |
2024-04-29T06:25Z | 8 | 1204 | 12.8 | 7.6 | 0.7 | 2024-04-23: added Spotify with lite feed |
2024-05-05T06:25Z | 6 | 1285 | 12.6 | 8.9 | 1.4 | |
2024-05-13T06:24Z | 8 | 1070 | 7.1 | 6.4 | 0.8 | |
2024-05-19T06:25Z | 6 | 1144 | 6.1 | 7.5 | 0.6 | |
2024-05-27T06:25Z | 8 | 1607 | 8.9 | 11.2 | 0.7 | 2024-05-24: Googlebot goes rogue |
2024-06-02T06:25Z | 6 | 2393 | 10.8 | 14.2 | 0.4 | |
2024-06-10T06:25Z | 8 | 2345 | 11.6 | 13.6 | 0.5 | 2024-06-08: added 503 rejections for top-3 bots |
2024-06-16T06:25Z | 6 | 2282 | 11.5 | 11.6 | 0.5 | |
2024-06-23T06:25Z | 7 | 2322 | 11.4 | 11.4 | 0.4 | |
2024-07-01T06:25Z | 8 | 1460 | 7.0 | 8.4 | 0.2 |
2024-06-24: Googlebot reined in; blocked from feed in robots.txt
2024-06-28: added skipDays to podcast feed |
2024-07-07T06:25Z | 6 | 1562 | 7.7 | 8.7 | 0.5 | 2024-07-05: now using 503 error codes instead of 429s; more clients respond in some way to 503 |
2024-07-14T06:25Z | 7 | 1666 | 8.0 | 8.3 | 0.4 | |
2024-07-22T06:25Z | 8 | 1526 | 6.4 | 9.2 | 0.4 | |
2024-07-28T06:25Z | 6 | 1557 | 6.1 | 8.8 | 0.2 | |
2024-08-05T06:25Z | 8 | 1524 | 6.6 | 9.3 | 0.6 | |
2024-08-11T06:24Z | 6 | 1510 | 6.5 | 8.6 | 0.4 |
2024-08-08: RSS feed lastBuildDate now timestamp of newest primary media file in feed after any filtering, so much older than previously
2024-08-09: RSS feed timestamped today |
2024-08-19T06:25Z | 8 | 1274 | 5.1 | 8.2 | 0.4 | |
2024-08-25T06:25Z | 6 | 1308 | 5.4 | 8.2 | 0.4 | |
2024-09-02T06:24Z | 8 | 1314 | 4.4 | 8.6 | 0.4 | |
2024-09-08T06:25Z | 6 | 1416 | 4.9 | 9.8 | 0.5 | |
2024-09-16T06:25Z | 8 | 1340 | 4.2 | 8.3 | 0.3 | |
2024-09-22T06:25Z | 6 | 1381 | 4.9 | 9.0 | 0.4 | |
2024-09-30T06:24Z | 8 | 1375 | 4.2 | 8.4 | 0.3 | |
2024-10-06T06:25Z | 6 | 1378 | 4.2 | 8.0 | 0.2 | |
2024-10-14T06:25Z | 8 | 1360 | 4.1 | 8.2 | 0.3 | |
2024-10-20T06:25Z | 6 | 1287 | 4.0 | 7.2 | 0.3 | |
2024-10-28T06:25Z | 8 | 1283 | 4.0 | 8.4 | 0.4 | |
2024-11-03T06:25Z | 6 | 1270 | 3.9 | 6.7 | 0.2 | |
2024-11-11T06:25Z | 8 | 1129 | 3.5 | 7.8 | 0.2 |
A very rough estimate is that the ~1000 feed fetches per day as of consumes ~100J or ~100Ws or ~0.03Wh for my RPi alone to service, ignoring all network and other costs up to and including the requester. Almost all of that (~99%) is unnecessary. Note that my RPi server may be significantly more energy efficient (~20x) than a standard datacentre server [varghese2014greening] [everman2018GreenWeb], while managing comparable response speed.
2024-04: size of the problem
For the EOU Web (off-grid, RPi) server hosting a mixture of static sites including EOU, over the 8 days from
to
25,881,279,225 bytes (~26GB) (sum of column 11 in the logs) were served over 301,193 requests (eg GET
and HEAD
) ie log lines.
Filtering for requests for /rss/podcast.rss
gives 134,263,853 bytes (~134MB, ~0.5%) over 8,618 requests (~2.9%).
The traffic to all of EOU in this interval is 8,927,622,485 (~9GB) over 115,247 requests, so /rss/podcast.rss
is ~7.5% of EOU hits, ~1.5% of EOU bytes.
Note that this podcast RSS file does not contain the body text of articles nor audio/video content, only summaries and links. Some RSS feed files (not at EOU) contain the full text for their entries.
134MB per week or ~600MB per month (and ~7.5% of all EOU server requests) to check for new entries in the RSS feed, which emerge less than once per month on average, is excessive. And this feed has a very small number of readers, including only a very small number of direct clients polling, eg from browser RSS readers or mobile phone podcast players.
This represents a waste of CPU and bandwidth and thus energy for all participants. Battery life also for mobile clients. Given that the system is not run on entirely zero-carbon energy, this in turn will be hurting the climate.
Ofcom: Audio listening in the UK () notes that A fifth of adults listen to podcasts each week, with reach higher among the under 35s and those in higher socioeconomic groups. Those who do listen to podcasts listen to an average of five per week.
[ofcom2024listening] Which implies that a daily poll/update of each podcast feed might be a good default, rather than many times per hour!
A live view of RSS podcast hits and bytes as a fraction of EOU site desktop traffic is available. One aim is to keep these values below ~4.5% of hits and ~1% of bytes as seen on after some defensive measures were put in place, even if the number of podcast listeners goes up.
As of during the day, about one fetch of the feed per minute is seen and representative...
A very rough energy estimate is that the ~1000 feed fetches per day as of consumes ~100J or ~100Ws or ~0.03Wh for my RPi alone to service, ignoring all network and other costs up to and including the requester. Almost all of that (~99%) is unnecessary.
Scaling up to ~200k faster updating podcasts than mine [PBJ2024creation] would imply ~5kWh wasted per day just on podcast feed serving. Many feed files may be larger (eg full-text) and have far more listeners. Scaling to the ~4M podcasts in the Podcast Index directory (given that the main feed pullers do not seem to slow down for quiet feeds) suggests in excess of 100kWh wasted per day, just on podcast feed serving.
More on problem size...
Stats
(See some of the scripting tools that I am using to extract and present data.)
Count | Bytes | User-Agent ("-" means none, ALL is total) |
---|---|---|
8618 | 134263853 | ALL |
2769 | 30608880 | "Amazon Music Podcast" |
1458 | 39332327 | "iTMS" |
653 | 6895886 | "Podbean/FeedUpdate 2.1" |
437 | 8646182 | "-" |
254 | 2713382 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" |
Count | Bytes | User-Agent ("-" means none, ALL is total) |
---|---|---|
8618 | 134263853 | ALL |
1458 | 39332327 | "iTMS" |
2769 | 30608880 | "Amazon Music Podcast" |
437 | 8646182 | "-" |
653 | 6895886 | "Podbean/FeedUpdate 2.1" |
100 | 4235406 | "Podchaser (https://www.podchaser.com)" |
iTMS
appears to be overwhelmingly Apple (Apple also has an itms
agent), with a handful of hits from a feed validator.
So Apple and Amazon are clearly dominant in terms of traffic, and probably no one wants to complain too much because of their dominance in the market.
The anonymous (no User-Agent
) traffic bears examination too.
Podbean appears to make about one request a day from each of tens of instances located in data centres (ie these appear not to be end-user podcast player requests).
Podchaser appears high in the by-bytes list because, like iTMS
, it does not accept compression and thus uses ~8x more bandwidth per fetch than a client that does.
Note that for this interval requests are fairly evenly spread over 24h, with a little more traffic in UK day and evening.
Count | Bytes | Hour UTC |
---|---|---|
303 | 4573230 | 00 |
340 | 5748777 | 01 |
328 | 5203708 | 02 |
336 | 5664193 | 03 |
354 | 5792703 | 04 |
349 | 6714477 | 05 |
330 | 5144024 | 06 |
338 | 5197583 | 07 |
331 | 5551755 | 08 |
316 | 4765563 | 09 |
345 | 5205566 | 10 |
348 | 5347440 | 11 |
435 | 6084557 | 12 |
345 | 5260004 | 13 |
393 | 5699269 | 14 |
395 | 5937681 | 15 |
370 | 5690353 | 16 |
404 | 7035478 | 17 |
437 | 6078302 | 18 |
444 | 6730083 | 19 |
340 | 5415327 | 20 |
389 | 5864647 | 21 |
335 | 4920618 | 22 |
313 | 4638515 | 23 |
Looking at logs (before more aggressive 406
/429
defences were raised) for
to inclusive, the top-35 RSS feed bad boys/bots are:
Count | Bytes | User-Agent ("-" means none, ALL is total) |
---|---|---|
9643 | 175368483 | ALL |
2806 | 34128111 | "Amazon Music Podcast" |
2401 | 73456937 | "iTMS" |
542 | 6380012 | "Podbean/FeedUpdate 2.1" |
483 | 8501504 | "-" |
360 | 4300947 | "Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio" |
250 | 3799126 | "itms" |
242 | 2815836 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" |
196 | 192996 | "FeedBurner/1.0 (http://www.FeedBurner.com)" |
192 | 2258465 | "fyyd-poll-1/0.5" |
190 | 2600966 | "Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXXX; +http://overcast.fm/)" |
141 | 1471059 | "PocketCasts/1.0 (Pocket Casts Feed Parser; +http://pocketcasts.com/)" |
123 | 1385317 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0" |
110 | 1642327 | "NRCAudioIndexer/1.1" |
103 | 1566098 | "gPodder/3.11.1 (+http://gpodder.org/) Linux" |
98 | 1150676 | "CastFeedValidator/3.6.1 (https://castfeedvalidator.com)" |
97 | 1155327 | "axios/1.5.1" |
90 | 1359564 | "TPA/1.0.0" |
88 | 1043874 | "iVoox Global Podcasting Service" |
77 | 966741 | "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" |
65 | 991562 | "PodcastRepublic/18.0" |
64 | 5675347 | "deezer/curl-3.0" |
62 | 618821 | "Aggrivator (PodcastIndex.org)/v0.1.7" |
57 | 666816 | "TuneIn-Podcast-Checker" |
53 | 815776 | "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" |
48 | 729556 | "Podcasts/1555.2.1 CFNetwork/1237 Darwin/20.4.0" |
48 | 493782 | "Wget/1.21.3" |
44 | 669960 | "ListenNotes/3.0 (id=XXXX; +https://www.listennotes.com/about/)" |
35 | 433865 | "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.66 Safari/537.36" |
32 | 1437205 | "Podchaser (https://www.podchaser.com)" |
31 | 340478 | "SpaceCowboys Android RSS Reader / 2.6.21(306)" |
28 | 209109 | "AntennaPod/3.3.2" |
20 | 211622 | "okhttp/4.9.3" |
19 | 227454 | "Mozilla/5.0 (compatible; MuckRack/1.0; +https://muckrack.com)" |
19 | 223363 | "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36" |
Note Podnews RSS Stats: for 2024-04-15: ??? Unknown 7,101 every 0 minutes; Zapier 2,283 every 1 minutes; Google Podcasts and Search 2,199 every 1 minutes; NetNewsWire 616 every 2 minutes; PodcastAddict 609 every 2 minutes; Reeder 422 every 3 minutes; Amazon Music Podcasts 345 every 4 minutes; Overcast 288 every 5 minutes; iHeartRadio 248 every 6 minutes; FreshRSS 233 every 6 minutes; AntennaPod 230 every 6 minutes; ...
2024-04-23: Spotify
A special 'lite' podcast feed was created, which is the normal feed minus video and music-only episodes, and limited to the most recent ~10 episodes. This feed was added to Spotify but not otherwise publicised.
Spotify is polling about every 7 minutes, and does seem to support at least gzip
compression, and is doing conditional GETs.
2024-05-25: energy estimate
Running the following on my laptop on the same LAN as the server to estimate time taken to service a request (resulting in 406
fail) for the feed file:
% curl -so /dev/null -w '%{time_total}\n' https://www.earth.org.uk/rss/podcast.rss
yields a time on average of ~0.09s. Given that the server effort should largely be within that window (other than TCP connections fully closing).
From an ~1W/~2W power consumption estimate of the RPi server when idle and busy, that implies roughly 0.09Ws or 0.09J cost to fail each such fetch.
A successful (compressed-response) same-LAN fetch takes on average ~0.14s, thus 0.14J.
Figures from [steven2023solar] 44.5ms / ~10.4mJ for a 10kB HTTPS request on an NGINX static Web server on a BeagleBone AI (similar to a Raspberry Pi), and [ou2012ARM] serving up to 180 requests per J (so as little as ~6mJ per request) at 2500 requests per second for 30kB static files from a small ARM cluster (four PandaBoard development boards with dual-core Cortex A9 MPCore), suggest that the above estimates are at least plausible.
Rejecting a fetch seems to save a third of the time, and thus possibly energy.
% curl --compressed -so /dev/null -w '%{time_total}\n' https://www.earth.org.uk/rss/podcast.rss
A very rough estimate is that the ~1000 feed fetches per day as of consumes ~100J or ~100Ws or ~0.03Wh for my RPi alone to service, ignoring all network and other costs up to and including the requester.
(Note that my RPi server may be significantly more energy efficient (~20x) than a standard datacentre server [varghese2014greening], also [everman2018GreenWeb], while managing comparable response speed.)Looked at in other ways:
- the entire server takes ~24Wh per day to run when lightly loaded, at ~1W
- assuming other limits are not hit, the server could handle ~600k compressed feed fetches per day, which would be ~6GB at 10kB, excluding TCP and TLS overheads
- my outbound bandwidth, ~80Mbps, which the RPi probably could not fill, is ~900GB per day
Note that every request also takes space in the logs, which is more wear on the RPi's SD card, and also more time for all analyses and protections to process.
Interactions with Technology Providers
Various providers of pieces of the technology puzzle (eg aggregators, mobile podcast app writers) were contacted to better understand behaviour of their systems, and possibly nudge them in a good direction.
Some of the interactions are summarised below.
Linux Audit
Please also see the Linux Audit parallel work RSS is cool! Some RSS feed readers are not (yet) [boelen2024cool], including:
- Slackbot
- Newsboat
- Selfoss
- Feedbin
- Tiny Tiny RSS
- Miniflux
- Nextcloud
- Feed on Feeds / SimplePie
- Feedly
LA pointed out as relevant the openrss.org
issues list [ORSStracker].
See also [kroll2024practices]. [kroll2024roundup].
Email and other content has been edited to preserve confidentiality, etc, as appropriate:
The Earth Notes Podcast RSS has been registered with Amazon Music for Podcasters. Amazon serves as an aggregator and catalogue.
On I sent Amazon (UK) podcasting an email containing:
May I ask why you are polling my podcast RSS feed every few minutes when it usually updates only every few weeks? Probably more than all other users combined...
(See a sample of the log below.)
Also the skipHours in the RSS and the 3h+ Cache-Control / Expires HTTP headers that I have set seem to be ignored, and there appears to be no attempt to use If-Modified-Since or If-None-Match.
What am I doing wrong?
RSS file start:
Log sample:
Note that the Amazon requests come in from a large variety of IP addresses, with those checked being from within the Throwing After being prodded the US-Global support team replied :
Please note, this request goes beyond the scope of support our team offers, and therefore will take some time before we receive a response from the engineers.
As of Amazon is not setting conditional request headers nor allowing Brotli compression (Apache 2024-07-11: Amazon: All our operations now run on renewable energy: but pushing energy waste downstream onto feed hosts is in part greenwashing. The Earth Notes Podcast RSS has been registered with Apple's iTunes podcast catalogue. Apple serves as an aggregator and catalogue, and hosts the de facto canonical podcast catalogue.
Apple says at
RSS feed refresh:
Apple Podcasts checks RSS feeds frequently to detect new episodes and any other metadata or artwork changes so that listeners have access to the latest as soon as possible.
These changes usually display quickly — often within a few hours. You can view the time and date when each show was last refreshed from your show information pages in Apple Podcasts Connect.
On I contacted Apple via its Is there any way that I can set your RSS fetcher to honour Cache-Control (or Expires) and/or SkipHours? Currently it does not seem to.
My server is off grid and I'd prefer polling to be minimised in the hours I include (23Z to 07Z).
Done right this could save a lot of bandwidth, CPU and carbon for you and the servers that you poll.
An initial response said that
I responded with:
This could be added to the other simple technical fixes that Apple already implements to reduce carbon emissions from unnecessary CPU and bandwidth use.
I note that your agent polls very frequently and often does not even use compression, ie is not compliant with even basic de facto etiquette.
Some example Apple fetches, including uncompressed On I was provided with links to
Apple Podcasts feedback,
Environment, and the contact email for environment report feedback.
There is also poor behaviour like this (all from the same IP address):
As of Apple is not setting conditional request headers nor allowing any compression (Apache As of at least one feed does get conditional request headers from It seems likely that the conditional headers in the first case are actually being injected by a CDN being used, and thus there is a lot of hidden wasteful chat between Apple and the CDN, putting up costs and emissions.
As of
with With the RSS feed files not update since (timestamped) August 9th, polls seem to be hourly, but with two HTTP calls per poll:
AntennaPod uses conditional fetches for the RSS feed file. When set with a 12h refresh interval log entries for the feed fetch are (noting underlying an feed file change before the I happened to notice these (unconditional) polls from one IP address by ~19:00Z, even though the RSS file is unchanged for more than 2 months (2024-08-09):
Feeder
is an open-source feed reader and podcast player for Android mobile devices. I noticed its user agent in the Earth Notes logs.
I asked (by logging an 'idea'):
To which the author responded:
regular http cache-control is already supported.
what's skiphours?
I pointed the author at the
I added:
The author noted in the exchange that in version 2.6.20 (of )
I gave 2.60.21 a sneaky test run and the log showed:
After loading the new app version, telling it the feed URL and messing around, then forcing a A set of ~hourly interactions for the 2.6.20 Feeder version by another user for a different feed during a period where it was unchanged:
: I have seen what appears to be one other user upgrade to 2.6.21 and RSS polling traffic is tiny, even if still unconditional. So I am recommending Feeder to my podcast page visitors.
Three different clients (the last three hits are from the same client) all getting In an further update I noted
skipHours are not exposed by gofeed (in shared model) so it can't be seen by feeder atm
and
I explained:
The Earth Notes Podcast RSS has been registered with the I emailed a suggestion :
Would it be possible to support the RSS SkipHours tag in future, and/or respect the Cache-Control/Expires/ETag headers from the fetch?
As of ,
Googlebot started behaving badly on the feed, retrying again a minute after each There was a conversation in Mastodon DMs with someone at Google. An interesting point they raised was:
I responded to various bits:
I guess that this was 'search' crawling. But yes, if the bot would read the Today's behaviour is very abnormal for Googlebot, and normally I see it adapting quite nicely to things going on in my site. I have been coaxing it to get more 304s of late by working round what turns out to be a bug in Apache since ~2008!
My RPi idle is ~1W, and busy ~2W. I have not (yet) quantified microjoules per byte! What I'd especially like is for my server to be left longer in sleep mode especially at night. That wakeup and the pointless flurry of packets when nothing has changed is I suspect the biggest cost, which is why I'd like Apple et al to pay attention to And later...
Here is a VERY rough initial estimate of energy demand on my RPi to service a request for the feed:
A successful (compressed-response) same-LAN fetch seems to take ~0.14s, thus 0.14J, given an idle/busy power on the RPi of ~1W/~2W:
Scaling to the ~4M podcasts in the Podcast Index directory (given that the main feed pullers do not seem to slow down for quiet feeds) suggests in excess of 100kWh wasted per day, just on podcast feed serving.
It is pleasant not to just be effectively told to go away.
A successful (compressed-response) same-LAN fetch seems to take ~0.14s, thus 0.14J, given an idle/busy power on the RPi of ~1W/~2W:
(A Still going at 10:00Z, polling about once per minute in the face of many-hour runs of 429s for a file that has not changed since 2024-05-12T13:04Z!
I have this evening directly
reported
the issue:
As of I have what looks like one client (one IP) downloading unconditionally every 20 minutes, which puts it on the edge of the top-10 (ie including Apple, Amazon and Spotify)!
I have not seen a I have filed
an issue
asking if I had a quick response (test URL now supplied):
Yes, we support If-Modified-Since.
Please provide a test url if possible, to debug...
We don't support Cache-Control or Expires at the moment, only Etag/If-Modified-Since.
It seems that 3.11.4 (unreleased) and the 3.11.1 should support
But in any case the aberrant polling stopped spontaneously, for most nights...
An odd thing is visible in Apache mod_log_forensic output:
This client is sending a fine Here is what seems to be a better request, possibly from the package maintainer!
The maintainer says:
yes, completely: we keep the stored value if the server returns None or empty.
... Let's make a quick fix where if either last-modified or etag is present and the other is missing we clear the stored value.
I sent a message to iHeartRadio support to let them know that it is sending a bogus very old
At at , Version 6.1.4 (6120) on macOS, default behaviour is to update the feeds every hour with a conditional GET:
Issue I have commented:
May I suggest treating it and Cache-Control max-age the same, to maintain a don't-update-before internal date?
Slow-updating feeds really don't need to be checked every hour if they tell you to cache things or go away for several hours!
And you'll help save some battery life and data charges and carbon emissions for client and server too...
Dave of Podcast Index
commented on the Fediverse about buggy servers and workarounds:
And he offered up source code to boot!
Another good thing that others should do:
At Dave's suggestion I looked at how Aggrivator is polling my feed:
He may add some logic to explicitly honour I suggested to And I plugged implementing (and helping revive) I sent an email to Podnews about its then rejection of the EOU feed: I am hopeful from further emails that Captivate and RSS.com should now soon switch to GZIPped feeds. In A saving bandwidth special! Lots of good stuff including: Podnews has now added As of
a special 'lite' podcast feed was created, which is the normal feed minus video and music-only episodes, and limited to the most recent ~10 episodes. This feed was added to Spotify but not otherwise publicised.
Spotify is polling about every 7 minutes, and does seem to support at least Spotify seems to generate malformed conditional request headers, at least sometimes, eg like:
rather than:
This is causing some of the defences to reject some requests, believing the requests to be unconditional, though it looks as if Apache may be more tolerant.
One request extract from Apache Spotify is also not accepting I raised both points with Spotify support on chat.
After about : Spotify has stopped sending
: I have been seeing intermittent use of TuneIn
(RSS feed fetcher user agent It seems to poll faster than hourly, not respecting HTTP cache control or RSS It seems to be hosted on AWS (Amazon Web Services).
I used the contact form to ask:
RSS feed polling excessively
Is there any way that I can get your RSS fetcher to honour Cache-Control (or Expires) and/or SkipHours? Currently it does not seem to, and is polling far more often than makes sense. I am concerned about climate impact.
After several miscommunications, including attempting to create me an account, I sent further explanation:
I am referring to how often you poll my RSS feed at https://www.earth.org.uk/rss/podcast.rss
It updates with new content about monthly.
You poll it about every 30 minutes, and don't seem to pay any attention to Cache-Control, Expires, Last-Modified or ETag, nor the skipHours tag (or other update-hint tags) in the RSS feed itself, eg:
You are wasting a tremendous amount of your CPU time and bandwidth and feed providers' (such as me), with an accompanying hit on all our bills and climate emissions. I only have a small off-grid server which is not updating the feed overnight for example.
Is there anything we can do to make this better?
I note that some of the other services polling the same feed are making use of at least some of those fields and hints.
Less-than-monthly according to Listen Notes I received a response offering to offer to extend the polling interval on my feed from 4h to 40h because I accepted the increase to 40h, but asked:
But what do you mean by "headings are unfortunately not reliable across our directory”? HTTP cache control headers are very basic, and if you don't trust them entirely, you can limit whatever cache life you see to (say) 1 day or even 12h, vastly reducing pointless polling traffic (and climate emissions) for many (slow) feeds.
... and my ticket was closed!
I asked anyway:
Things are looking a little better. Note that each IP address (other than for Still not quite one poll every 40h (more like 6h!), but much better anyhow! I hope that TuneIn also at least thought about how wastefully it is polling everyone else too...
On I asked Podbean to implement any of skipHours, Expires, Cache-Control, If-Modified-Since, since their traffic was very visible in my logs.
This got a response a couple of days later:
I responded that their polls (from many IPs what appear to be, probably their, datacentre-based bots) are far more than daily, eg in one log sample I sent them, more like ~8 per hour, and then in another, 22 in under one minute...
Podbean traffic is still very visible in my logs, now showing up at #3 by hits. So I have emailed again:
I do not know if those are many separate human clients, or all your machines in datacentres (all the IPs that I have checked are datacentre-based).
But you are showing up as #3 of ALL clients polling my RSS feed. Only Amazon and iTunes' horrible implementations consume more than you.
And most of those polls are completely redundant, ie if you did If-Modified-Since conditional GETs on all but (say) at most one poll per day you'd get 304 responses consuming less of your and my and Internet resources, and heating the planet less, with no worse outcomes.
And there must be a bug lurking given a Cache-Control value of never less than ~4h on the feed [successive hits from the same IP]:
So please re-consider honouring Cache-Control, use If-None-Match, and ideally also observe skipHours in the RSS file, or even do something smarter like other clients seem to, such as looking at the interval between recent updates/episodes.
At the moment I am rejecting ~25% of all Also this client has poor behaviour when batted away with a I had a response :
Our technical team has thoroughly reviewed the requests from Podbean and confirmed that the frequency of less than 200 requests per day falls within the normal range. There are no issues with this level of activity. If you have any further questions or concerns, feel free to reach out to us anytime.
To which I replied:
200 requests per day for something that updates less than once per month is as much 10000 (ten thousand) times too often and a 99.99% waste of resources. Even given a more typical weekly podcast update frequency this is maybe >1000x too often. I'd urge your team to reconsider for the sake of the climate and your clients' bandwidth bill and battery life if nothing else.
But in any case, there seems to be a bug around 429 that should be fixed:
When asked to slow down with 429, polling *faster* is bad. There is a Retry-After header present which ideally you should be honouring, but maybe wait at least wait an hour or so if you want to keep the code simple.
: I received an automated "did we fix your problem" email, to which I answered "no" on the grounds that none of the issues that I had reported had been addressed, and that wasting creator bandwidth and energy etc are not good.
As of
Podbean does (lots of) unconditional requests, over HTTP/2 unusually, (Apache On I happened to notice an odd double request in logs, and emailed the project owner:
I noticed what may be a bug in the app looking at my logs.
First a request that worked (albeit ignoring the skipHours set in the RSS feed).
Then immediately following is a redundant request that I rejected because it will have been making a bad request in some different way:
I can see this duplicate request pattern at various times in my logs.
The author replied very quickly:
The app only accesses the RSS feed once, unless an error is returned in which case it will retry by chaining some headers parameters.
So following a 200 response code, the app will not reconnect on its own. However, if the user for some reason presses the refresh button quickly multiple refreshes will happen again
I sent him another 06:00Z example to look at from two days earlier,
He replied that his app relies on Note that The author says that
More on interactions...
Misc and Open Issues
HEAD
in favour of conditional GET
)!
cleed
who already supports conditional GET and compression, and is working on my other wish-list/TL;DR items of Cache-Control
, skipHours
/skipDays
and 429/503 Retry-After
.
PodscanBot
seems relatively well behaved and visits about 4 times per day, usually allowing compression, and their crawler page says Podscan practices bandwidth and connection reduction strategies.
But the bot does not seem to use Cache-Control
max-age
or If-Modified-Since
or skipHours
or skipDays
, so I emailed to ask... A helpful and detailed reply arrived the next day including a note that ... we do use both If-Modified-Since and ETag headers ... This has been the most impactful refactoring. Cut down bandwidth to under 10% of what it used to be...
Cache-Control
max-age
and Retry-After
; it already does gzip
and If-Modified-Since
apparently.Please consider making (RSS) skipHours as easy to use as possible to help gofeed clients minimise feed energy and bandwidth impacts, especially in the light of more and more grid power coming from solar.
Please also consider somehow facilitating responding well to (200 and 304) Cache-Control max-age and (429 and 503) Retry-After in each case to generate a "do-not-refetch-before" time?
Support Cache-Control/Expires and skipHours
.
Support for SkipHours and Cache-Control/Expires
.
Consider supporting skipHours and/or If-Modified-Since
.
Amazon
...
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:atom="http://www.w3.org/2005/Atom” xmlns:podcast="https://podcastindex.org/namespace/1.0" xml:lang="en-gb">
<channel>
<atom:link href="https://www.earth.org.uk/rss/podcast.rss" rel="self" type="application/rss+xml"/>
<title>Earth Notes Podcast</title>
<description>All things green and efficient @Home in the UK, cutting carbon and improving comfort.</description>
<link>https://www.earth.org.uk/SECTION_podcast.html</link>
<language>en-gb</language>
<itunes:author>Earth Notes / Damon Hart-Davis</itunes:author>
<itunes:owner><itunes:email>d@hd.org</itunes:email></itunes:owner>
<itunes:image href="https://www.earth.org.uk/img/wordcloud/podcast-1.png"/>
<itunes:category text="Education"/>
<itunes:category text="Technology"/>
<itunes:explicit>no</itunes:explicit>
<podcast:location geo="geo:51.406696,-0.288789,16">16WW, Kingston-upon-Thames, UK</podcast:location>
<ttl>367</ttl>
<skipHours><hour>0</hour><hour>1</hour><hour>2</hour><hour>3</hour><hour>4</hour><hour>5</hour><hour>6</hour><hour>7</hour></skipHours>
<item><title>2024-01-28 Diarycast - Year In Review (2023)</title><description>The rollercoaster thrills and spills of 2023 at EOU Towers... #podcast #yearInReview</description><link>https://www.earth.org.uk/diarycast-20240128.html</link><guid isPermaLink="false">img/audio/diary/20240128.mp3</guid><enclosure url="https://www.earth.org.uk/img/audio/diary/20240128.mp3" length="4859775" type="audio/mpeg"/><pubDate>Sun, 28 Jan 2024 13:51:53 GMT</pubDate><itunes:duration>271</itunes:duration></item>
[12/Mar/2024:05:33:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:35:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:41:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:46:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:47:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:53:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:59:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:05:59:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:05:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:11:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:12:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:17:15 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:23:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:25:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:29:28 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:35:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:38:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:41:15 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:47:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:51:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:53:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:06:59:15 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8351 "-" "Amazon Music Podcast"
[12/Mar/2024:07:04:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8350 "-" "Amazon Music Podcast"
[12/Mar/2024:07:05:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8350 "-" "Amazon Music Podcast"
[12/Mar/2024:07:11:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 8350 "-" "Amazon Music Podcast"
compute.amazonaws.com
zone.
429
(Too many requests
) codes at Amazon
slows it down about 10-fold.
...
mod_log_forensic
extract):
GET /rss/podcast.rss HTTP/1.1|User-Agent:Amazon Music Podcast|Host:www.earth.org.uk|Connection:Keep-Alive|Accept-Encoding:gzip,deflate
Apple
Podcasts for Creators
portal, including the following:
...
I've received confirmation from our internal teams that we do not provide any technical support for the implementation of your requested changes.
...
GET
s:
[02/Apr/2024:19:01:14 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS"
[02/Apr/2024:19:01:14 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS"
[02/Apr/2024:19:01:14 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 79283 "-" "iTMS"
[02/Apr/2024:19:16:36 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS"
[02/Apr/2024:19:16:36 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS"
[02/Apr/2024:19:16:37 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 79283 "-" "iTMS"
[02/Apr/2024:19:32:53 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS"
[02/Apr/2024:19:32:53 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS"
[02/Apr/2024:19:32:53 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 79283 "-" "iTMS"
[02/Apr/2024:19:51:44 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3599 "-" "iTMS"
[02/Apr/2024:19:51:44 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 412 "-" "iTMS"
[02/Apr/2024:19:51:44 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 79283 "-" "iTMS"
[30/Apr/2024:08:03:57 +0000] "GET /img/audio/diary/20200726/20200722-Waterloo-station-ticket-barriers-keep-your-distance-signage-sq-1000w.jpg HTTP/1.1" 200 48396 "-" "iTMS"
[30/Apr/2024:08:03:57 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[30/Apr/2024:08:03:57 +0000] "GET /img/audio/podcast-furniture/title/statscast-1.png HTTP/1.1" 200 4174 "-" "iTMS"
[30/Apr/2024:08:03:57 +0000] "GET /img/audio/podcast-furniture/title/statscast-1.png HTTP/1.1" 200 4174 "-" "iTMS"
[30/Apr/2024:08:03:57 +0000] "HEAD /img/site/podcast/20200523-Ambient-haiku.png HTTP/1.1" 200 339 "-" "iTMS"
[30/Apr/2024:08:03:58 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[30/Apr/2024:08:03:58 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
"-" "iTMS"
[30/Apr/2024:08:03:58 +0000] "GET /img/audio/podcast-furniture/title/metacast-1.png HTTP/1.1" 200 4109 "-" "iTMS"
[30/Apr/2024:08:03:59 +0000] "GET /img/audio/podcast-furniture/title/diarycast-1.png HTTP/1.1" 200 4180 "-" "iTMS"
[30/Apr/2024:08:03:59 +0000] "GET /img/audio/podcast-furniture/title/metacast-1.png HTTP/1.1" 200 4109 "-" "iTMS"
iTMS
is fetching and re-fetching the same cover art repeatedly, with no attempt to cache or de-duplicate even within one batch run. This may not be Apple's preferred use case (iTunes would prefer every episode's cover art to be distinct), but clearly no senior engineer that cares about efficiency has been given any time on this. Because everyone just has to put up with whatever Apple does? Handy that I made my images very compact.
mod_log_forensic
extract):
|HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
iTMS
: both If-None-Match
and If-Modified-Since
. But still no Accept-Encoding
, no no compression is possible! For them itms
makes a conditional request and allows gzip
.
<lastBuildDate>Mon, 01 Jul 2024 18:52:00 GMT</lastBuildDate>
and the feed timestamp Aug 9 00:17 rss/podcast.rss
, Apple iTMS
is down to hourly polling (though still double/treble and uncompressed and unconditional!):
[21/Aug/2024:09:12:34 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3536 "-" "iTMS"
[21/Aug/2024:10:15:57 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3536 "-" "iTMS"
[21/Aug/2024:10:15:58 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3536 "-" "iTMS"
[21/Aug/2024:11:06:07 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3634 "-" "iTMS"
[21/Aug/2024:11:06:07 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
[21/Aug/2024:12:04:15 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3731 "-" "iTMS"
[21/Aug/2024:12:04:16 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 362 "-" "iTMS"
[21/Aug/2024:12:04:16 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 113005 "-" "iTMS"
[21/Aug/2024:13:01:37 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3537 "-" "iTMS"
[21/Aug/2024:13:01:39 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3537 "-" "iTMS"
www.earth.org.uk:443 17.58.56.7 - - [25/Sep/2024:11:04:01 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
www.earth.org.uk:443 17.58.59.23 - - [25/Sep/2024:11:59:08 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3634 "-" "iTMS"
www.earth.org.uk:443 17.58.59.23 - - [25/Sep/2024:11:59:08 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
www.earth.org.uk:443 17.58.57.20 - - [25/Sep/2024:12:50:31 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 3731 "-" "iTMS"
www.earth.org.uk:443 17.58.57.20 - - [25/Sep/2024:12:50:31 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 200 362 "-" "iTMS"
www.earth.org.uk:443 17.58.57.20 - - [25/Sep/2024:12:50:31 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 113005 "-" "iTMS"
www.earth.org.uk:443 17.58.59.22 - - [25/Sep/2024:13:46:51 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 3634 "-" "iTMS"
www.earth.org.uk:443 17.58.59.22 - - [25/Sep/2024:13:46:51 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 406 265 "-" "iTMS"
www.earth.org.uk:443 17.58.57.102 - - [25/Sep/2024:14:44:44 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3537 "-" "iTMS"
www.earth.org.uk:443 17.58.57.104 - - [25/Sep/2024:14:44:45 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3537 "-" "iTMS"
AntennaPod
200
entry):
[01/Apr/2024:13:30:24 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11144 "-" "AntennaPod/3.2.0"
[02/Apr/2024:06:59:47 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 14443 "-" "AntennaPod/3.2.0"
[02/Apr/2024:19:07:46 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11266 "-" "AntennaPod/3.3.2"
[03/Apr/2024:08:12:00 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3443 "-" "AntennaPod/3.3.2"
[03/Apr/2024:20:12:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 14445 "-" "AntennaPod/3.3.2"
[04/Apr/2024:08:14:20 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 280 "-" "AntennaPod/3.3.2"
[25/Oct/2024:04:56:55 +0000] "GET /rss/podcast.rss HTTP/2.0" 503 388 "-" "AntennaPod/3.5.0"
[25/Oct/2024:10:02:42 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:13:07:11 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:14:54:24 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:14:54:34 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:15:30:31 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:17:57:25 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:18:53:14 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
[25/Oct/2024:18:54:25 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 12524 "-" "AntennaPod/3.5.0"
Feeder
Have you considered support for these (RSS-feed-specified SkipHours tag, and server-supplied HTTP expiry time/date) to reduce bandwidth and CPU?
:
skipHours
definition in the RSS 2.0 spec [RAB2009RSS]. He noted that:
Regarding skipHours, any implementation would result in stochastic behavior for users. The feature was designed for servers which can pick when they sync, but Feeder is not not in control of when its background sync runs. This is determined by Android.
A thought: maybe during skipHours you could avoid actually doing a poll when woken when there have been no non-skipHours since your last poll. The source is telling you that you will not (likely) have missed any change in that time.
One quirk is that Feeder will revalidate the cache if last sync is older than 15 minutes.
And in version 2.60.21 (of ) one of the fixes is
Tweaked Cache-Control headers to respect site headers even more
.
2.60.21
[03/Apr/2024:18:42:04 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 10965 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[03/Apr/2024:18:42:05 +0000] "GET /SECTION_podcast.html HTTP/2.0" 200 11259 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[03/Apr/2024:18:43:23 +0000] "GET /img/wordcloud/podcast-1.png HTTP/2.0" 200 71167 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[03/Apr/2024:18:43:30 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 10965 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[03/Apr/2024:18:51:32 +0000] "GET /img/site/podcast/20200523-Ambient-haiku.png HTTP/2.0" 200 90726 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[03/Apr/2024:19:12:31 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 10965 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[04/Apr/2024:06:38:51 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 10965 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[04/Apr/2024:13:49:26 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 10965 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[04/Apr/2024:18:40:12 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 10965 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
Sync feeds
(), the feed was not reloaded until I picked the phone up at . Feeder is set to the default nominal 1h between refreshes of the feed. (Though Feeder then did unconditional fetches (200
) which should probably have been 304
given that the feed file was unchanged since
, and ideally deferred until after skipHours
ie
.) Good progress!
[03/Apr/2024:17:17:28 +0000] "GET /rss/note-on-site-technicals.rss HTTP/2.0" 200 7285 "-" "SpaceCowboys Android RSS Reader / 2.6.20(305)"
[03/Apr/2024:18:24:03 +0000] "GET /rss/note-on-site-technicals.rss HTTP/2.0" 200 7285 "-" "SpaceCowboys Android RSS Reader / 2.6.20(305)"
[03/Apr/2024:19:25:25 +0000] "GET /rss/note-on-site-technicals.rss HTTP/2.0" 200 7285 "-" "SpaceCowboys Android RSS Reader / 2.6.20(305)"
[03/Apr/2024:20:25:40 +0000] "GET /rss/note-on-site-technicals.rss HTTP/2.0" 200 7285 "-" "SpaceCowboys Android RSS Reader / 2.6.20(305)"
[03/Apr/2024:21:25:44 +0000] "GET /rss/note-on-site-technicals.rss HTTP/2.0" 200 7285 "-" "SpaceCowboys Android RSS Reader / 2.6.20(305)"
304
s since I also turned off ETag
for RSS feed files to avoid a
long-standing bad interaction with mod_deflate
in Apache:
[15/Apr/2024:07:02:11 +0000] "GET /rss/saving-electricity.rss HTTP/2.0" 304 93 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[15/Apr/2024:07:02:29 +0000] "GET /rss/note-on-site-technicals.rss HTTP/2.0" 304 93 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[15/Apr/2024:09:07:18 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 223 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[15/Apr/2024:09:07:18 +0000] "GET /rss/saving-electricity.rss HTTP/1.1" 304 223 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
[15/Apr/2024:09:07:18 +0000] "GET /rss/note-on-site-technicals.rss HTTP/1.1" 304 223 "-" "SpaceCowboys Android RSS Reader / 2.6.21(306)"
Feeder seems to be playing very nicely with my site!
The author responded:
...
Listen, in 10 years of developing this reader I've never encountered skipHours before. It's a dead feature
My aim is for skipHours at least to be revived to help with energy conservation and fixing the climate, particularly in the typical podcaster's night time, given the huge amounts of solar going into world power grids. Simply avoiding doing unnecessary work when energy would have to come from storage will reduce the need for storage. There is a chicken-and-egg problem that no one will want to work with that tag until enough other people already are. But clearly *you* don't have to somehow fix that single-handed, and getting RSS feeds fixed is only a tiny part of the wider energy problem. And your other fixes already put you streets ahead of the big boys, so thank you!
fyyd
fyyd
directory. It seems to poll unconditionally for updates hourly: no 304
codes are returned even when the feed file is not changing.
...
fyyd
makes unconditional fetches:
GET /rss/podcast.rss HTTP/2.0|User-Agent:fyyd-poll-1/0.5|Accept:*/*|Accept-Encoding:gzip,deflate|Host:www.earth.org.uk
Google
429
response. This is a worse variant of the Podbean behaviour which at least stops after ~3. This racked up over 800 accesses by the end of day. Googlebot went in the greedy bots list and started receiving the most aggressive rate limiting controls.
... one thing which might be skewing your metrics (though looking at the numbers not significantly) is that we use Googlebot for various parts of websearch, and it's not necessarily possible to differentiate externally. Eg, how many of the RSS requests were for websearch vs for podcast search?
They were also interested in the underlying thesis:
... given your use of an RPI as a server, I wonder if you'd have a chance to monitor the actual energy usage for various kinds of requests? My assumption has been that crawling is visible (log files), but overall not a big energy consumer compared to the web's energy consumption overall.
skipHours
data in the RSS feed file it would know not even to try between 22:00Z and 07:59Z. The bot is still banging away so hard that it's now outpacing Spotify and is in the sin bin, so I think is just broken/confused! First request to your engineering colleagues: please respect any Retry-After
in a 429
response (and/or Cache-Control
max-age
in a 200
/304
) as far as possible! That could save a lot!
skipHours
+ Cache-Control
max-age
+ Retry-After
!
% curl -compression -so /dev/null -w '%{time_total}\n' https://www.earth.org.uk/rss/podcast.rss
Energy estimate
% curl --compressed -so /dev/null -w '%{time_total}\n' https://www.earth.org.uk/rss/podcast.rss
406
rejection takes ~0.09s.)
2024-05-27: continues...
2024-06-08: still rogue
Googlebot is generally sensible, but is pulling rss/podcast.rss every minute, even if given a 429 or 503 response code with a long Retry-After delta-seconds header.
gPodder
[29/Apr/2024:13:24:54 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[29/Apr/2024:13:44:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[29/Apr/2024:14:05:12 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[29/Apr/2024:14:25:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
304
for over a week, which may be to do with me turning off ETag
s:
[21/Apr/2024:08:02:49 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3473 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[21/Apr/2024:08:22:37 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3473 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[21/Apr/2024:08:42:33 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3473 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[21/Apr/2024:09:02:34 +0000] "GET /rss/podcast.rss HTTP/1.1" 304 3473 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
gPodder
could implement If-Modified-Since
(or skipHours
, or Cache-Control
/ Expires
to avoid a premature re-poll), on the grounds that this is likely to affect more than just me.
Last-Modified
, and did on a test.
... the Last-Modified and Etag headers are stored in DB alongside the podcast and relevant headers are added to the query if present.
GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-agent:gPodder/3.11.1 (+http%3a//gpodder.org/) Linux|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a07 GMT|If-None-Match:"239e-6168531446d9c"
If-Modified-Since
, but is also sending an If-None-Match
even though I have not been generating any ETag for many days! (The latter may prevent the former from working.)
GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-agent:gPodder/3.11.4 (+http%3a//gpodder.org/) Linux|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|If-Modified-Since:Mon, 29 Apr 2024 09%3a52%3a04 GMT
iHeartRadio
If-Modified-Since
header, and thus wasting lots of bandwidth (Apache mod_log_forensic
extract):
GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|user-agent:Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio|Accept-Encoding:gzip, deflate|Accept:*/*|Connection:keep-alive|if-modified-since:Tue, 03 Jan 2023 20%3a07%3a14 GMT
NetNewsWire
[26/May/2024:17:58:00 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 105 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
[26/May/2024:18:59:37 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 24 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
[26/May/2024:19:59:37 +0000] "GET /rss/podcast.rss HTTP/2.0" 304 106 "-" "NetNewsWire (RSS Reader; https://netnewswire.com/)"
Handle rate limiting for openrss.org #4224
states:
Pay attention to 429 status code and the Retry-After header.
Podcast Index (Aggrivator)
2024-06-20: header respect!
... some servers won't handle both at the same time. Sending If-none-match and If-modified-since both will result in returning the body no matter what the values are. So, I only send one or the other.
Also, if a feed hasn't updated in a certain time frame, it gets put on the don't check list, and goes into "lazy polling" mode, meaning it might only get checked once a month or even less. It's a "best effort" type algo.
It seems to vary between once per day (which is my target) and a few times per day but never rude. And lots of 304s. I don't think that you are quite following my Cache-control max-age overnight (I vary it dynamically to try to steer away from my skipHours). But not bad!
Cache-Control
rather than delegating it entirely to his request library.
Maybe treat it just as you may be treating Retry-After already, ie as a "do not try again before". Maybe capped at ~24h for broken servers.
skipHours
!
Podnews
2024-06-11: GZIPped!
... The current reason that your client and my server are not talking (406: Not Acceptable) is because your client does not even allow gzip compression (which compresses my feed 8x). Brotli would be nicer (10x), but just please allow at least gzip! It may help tamp down your bandwidth bills as I think that most feeds are highly compressible, as well as helping save the planet!
to which the immediate positive response was, in part, Well, I've learnt a new thing - PHP's file_get_contents doesn't transparently support GZIP. ... Podnews only polls RSS feeds once every 28 days, so it won't make much of a difference to our bandwidth bill; but every little helps.
Mere hours later the bot was converted to accept gzip
!
https://podnews.net/rss
just has!
2024-06-17: bandwidth bill for Podnews RSS feed has more than halved
Podnews wasn't supporting Gzip for our RSS feed (a mistake); it was accounting for 3.86GB of data per day. We turned it on last week; our bandwidth bill for our RSS feed has more than halved, and is now 1.22GB. We also now fetch feeds with gzip where we can.
... Brotli is better than Gzip. As of today, Podnews supports both Brotli and Gzip compression.
2024-06-21: skipHours and skipDays
skipHours
and skipDays
to its feed!
Spotify
gzip
compression, and is doing conditional GETs.
If-Modified-Since: Tue, 7 May 2024 09:46:22 GMT
If-Modified-Since: Tue, 07 May 2024 09:46:22 GMT
mod_log_forensic
:
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a22 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
br
(Brotli) compression.
If-Modified-Since
at all, ho hum:
GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive
[08/May/2024:09:27:26 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[08/May/2024:09:34:25 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 304 3625 "-" "Spotify/1.0"
[08/May/2024:09:41:25 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 429 4348 "-" "Spotify/1.0"
[08/May/2024:09:48:26 +0000] "GET /rss/podcast-lite.rss HTTP/1.1" 200 6101 "-" "Spotify/1.0"
If-Modified-Since
by Spotify, and
I am not the only one. So I have reported via their online support function again. I did also note I received no response at all via email to the previous point, FWIW. Even to disagree.
TuneIn
TuneIn-Podcast-Checker
) hosts a podcast directory.
skipHours
.
[03/Apr/2024:04:59:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:05:52:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:06:25:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:07:02:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
...
...
[03/Apr/2024:02:23:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:03:00:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:03:33:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:04:26:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:04:59:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:05:52:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:06:25:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:07:02:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:07:35:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:08:28:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:09:01:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:09:54:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:10:27:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:11:04:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:11:37:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:12:30:11 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:13:03:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:13:57:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:14:29:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:15:06:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11114 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:15:39:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11117 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:16:32:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11117 "-" "TuneIn-Podcast-Checker"
[03/Apr/2024:17:05:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11117 "-" "TuneIn-Podcast-Checker"
Update frequency: every 52 days Average audio length: 9 minutes
.
... we've found that the headings are unfortunately not reliable across our directory so our system doesn't take a look at them.
...
When can I expect to see the change to 40h polling? It is still more than hourly: see the log fragment below.
[07/Apr/2024:06:40:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:07:12:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:08:06:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:08:23:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:09:16:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:09:49:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:10:42:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:11:14:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:12:08:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:12:25:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11489 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:13:18:10 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11610 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:13:18:11 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11591 "-" "TuneInRssParser/1.0"
[07/Apr/2024:13:51:25 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11610 "-" "TuneIn-Podcast-Checker"
[07/Apr/2024:13:51:26 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11591 "-" "TuneInRssParser/1.0"
[07/Apr/2024:14:44:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11610 "-" "TuneIn-Podcast-Checker”
TuneInRssParser
) is unique in this log fragment:
[08/Apr/2024:20:55:12 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11715 "-" "TuneIn-Podcast-Checker"
[08/Apr/2024:20:55:17 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11696 "-" "TuneInRssParser/1.0"
[09/Apr/2024:14:38:27 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11965 "-" "TuneIn-Podcast-Checker"
[09/Apr/2024:14:38:28 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11946 "-" "TuneInRssParser/1.0"
[09/Apr/2024:21:20:12 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11965 "-" "TuneIn-Podcast-Checker"
[09/Apr/2024:21:20:13 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 11946 "-" "TuneInRssParser/1.0"
Podbean
... Podbean's daily requests for your feed are not frequent. Continuous requests occur because there are request failures, so we will retry this request. We will optimize this request. If a 429 error occurs again, we will increase the interval to reduce the crawling of your feed within a short period of time.
2024-04-20
...
[20/Apr/2024:11:04:50 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11393 "-" "Podbean/FeedUpdate 2.1"
[20/Apr/2024:11:19:10 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11389 "-" "Podbean/FeedUpdate 2.1"
Podbean/FeedUpdate 2.1
requests with a 429
during skipHours
, because of its non-conditional requests, when it should not be asking anyway.
429
. For example one client (one IP) below comes back every minute to retry, ignoring the several hours Retry-After
response header:
[22/Apr/2024:01:04:35 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
[22/Apr/2024:01:05:38 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
[22/Apr/2024:01:06:40 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
[22/Apr/2024:01:07:43 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
...
...
[22/Apr/2024:01:04:35 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
[22/Apr/2024:01:05:38 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
[22/Apr/2024:01:06:40 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
[22/Apr/2024:01:07:43 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 747 "-" "Podbean/FeedUpdate 2.1"
mod_log_forensic
extract):
GET /rss/podcast.rss HTTP/2.0|User-Agent:Podbean/FeedUpdate 2.1|Accept:*/*|Accept-Encoding:gzip|Host:www.earth.org.uk
Podcast Addict
...
[22/Apr/2024:06:00:05 +0000] "GET /rss/podcast.rss HTTP/2.0" 200 11373 "-" "PodcastAddict/v5 (+https://podcastaddict.com/; Android podcast app)"
[22/Apr/2024:06:00:05 +0000] "GET /rss/podcast.rss HTTP/2.0" 429 674 "-" "PodcastAddict/v5 (+https://podcastaddict.com/; Android podcast app)"
...
which makes me think that you may have a debounce issue in the GUI, though I still suspect it is more likely a glitch in your HTTP backend library...
okhttp
like most Android apps, so I pointed him at the
GitHub Feeder issue
where Feeder and my Apache server learnt to play together better!
okhttp
4.9.3 at least seems to generate technically invalid If-Modified-Since
headers: the 7 May
should be 07 May
(Apache mod_log_forensic
extract):
GET /rss/podcast.rss HTTP/2.0|Accept-Encoding:gzip, deflate|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a11 GMT|Cache-Control:public|User-Agent:okhttp/4.9.3|Host:www.earth.org.uk
The app is already using if modified since and behaves according to the returned value (skipping the update in case if 304).
But to get a 429
from my server the request must be missing both If-None-Match
and If-Modified-Since
headers (among other things), and polling within the hours forbidden in the RSS feed skipHours
.
Hints Dropped and Defences Erected
In order to give remote entities polling the RSS feed file as much chance as possible to avoid polling when it is pointless, wasting CPU and bandwidth, I provide a suite of hints, at least some of which any poller could act on.
I also provide alternateEnclosure
items, alternatives alongside the default MP3 (audio) or MP4 (video) file, that allow users to download much smaller versions if they wish, to save more bandwidth, data-charges, CPU, etc. I have not seen evidence of any client using (or able to use) these.
I also have defences to reduce pointless traffic somewhat.
More on hints and defences...
2024-04-03: snapshot
In the RSS file itself are the following lines in the channel
part:
<pubDate>Wed, 03 Apr 2024 12:58:31 GMT</pubDate> <ttl>1507</ttl> <skipHours><hour>0</hour><hour>1</hour><hour>2</hour><hour>3</hour><hour>4</hour><hour>5</hour><hour>6</hour><hour>7</hour><hour>22</hour><hour>23</hour></skipHours> <sy:updatePeriod>monthly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <podcast:updateFrequency rrule="FREQ=MONTHLY">monthly</podcast:updateFrequency>
This says that updates are expected roughly monthly and that updating once in that interval is OK, and that this feed has a TTL (time to live) of ~25h, ie can be cached that long, and that updates will generally not be happening from 22:00Z to 07:00Z so please do not poll then at all.
(Possibly the TTL should be higher, up to a month... As of I have pushed the value up to 4327 minutes, ie a little over 3 days.)
In the HTTP response headers for the feed file are the following relevant lines:
Date: Wed, 03 Apr 2024 18:15:19 GMT Last-Modified: Wed, 03 Apr 2024 15:34:48 GMT ETag: "133ff-61532f67e0edd" Cache-Control: max-age=14820 Expires: Wed, 03 Apr 2024 22:22:19 GMT
The Last-Modified:
allows an If-Modified-Since
conditional fetch. The ETag
allows an If-None-Match
conditional fetch. So if a conditional fetch is used and the feed file has not changed, then only a very small 304
status response is sent.
The Cache-Control: max-age
and Expires
are pushed out from this daytime poll's 4h7 to 7h7 during skipHours
. Paying attention to either header would push polling frequency well below the typical default ~1h. If a conditional fetch is done, only a slow string of tiny 304
s should happen almost all the time, and not even that in skipHours
ideally!
Also I defer any rebuilding of the rss/podcast.rss
file during skipHours
, or when the GB grid has high carbon intensity, or when the local battery is low. This should help reduce GB-grid-powered network traffic at these times.
2024-04-22: defences
Current defensive measures on top of the hints and update restrictions in place to reduce wasted bandwidth/CPU polling the RSS feed itself are:
During skipHours
(22:00Z to 07:59Z) when no polling should be happening at all ideally, making an unconditional request (eg no If-Modified-Since
), and in the absence of a Referer
, and when GB grid carbon intensity is high (or there is not even any identifying User-Agent
), the request will be rejected with 429
"Too Many Requests" with a long Retry-After
that matches the Cache-Control
max-age
for an accepted poll.
This is intended to allow through manual/human updates, especially from browsers and podcast catcher mobile clients, where possible, only blocking mindless brute-force waste.
# For RSS files (which will have skipHours matching the above), # if there is no Referer and no conditional fetching, back off # when battery is low or the grid intensity is high or there is no UA. # 429 Too Many Requests RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] # Not saying who you are (no User-Agent) and ignoring skipHours is rude. RewriteCond %{HTTP:User-Agent} ^$ [NV,OR] # Have any interaction with the filesystem as late as possible. RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f [OR] RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f RewriteRule "^/rss/.*\.rss$" - [L,R=429,E=RSS_RATE_LIMIT:1] Header always set Retry-After "25620" env=RSS_RATE_LIMIT
For attempts to fetch during skipHours
or when GB grid carbon intensity is high, making an unconditional request (eg no If-Modified-Since
), and in the absence of a Referer
, and with not even gzip
compression invited, the request will be rejected with 406
"Unacceptable". The lack of gzip
in Accept-Encoding
is technically the trigger for this rejection.
# Reject (bot) attempts to unconditionally fetch without compression. # 406 Unacceptable. RewriteCond %{HTTP_REFERER} ^$ RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:If-None-Match} ^$ [NV] RewriteCond %{HTTP:Accept-Encoding} !gzip RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" [OR] # Have any interaction with the filesystem as late as possible. RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f RewriteRule "^/rss/.*\.rss$" - [L,R=406]
Again, these rules are intended to allow human-driven requests though and only block brute-force badly-behaved bots.
These defences overlap, so a client may get either where they do, though only 406
can happen outside skipHours
for now.
Bad poll defences may be extended, especially 429
.
: I have extended the in-skipHours
expiry time to 10h7, so as to push the next allowed poll out of skipHours
.
: For EOU RSS traffic I have rearranged the defences to return a 429
in preference to a
406
if both are applicable, since a 429
does seem to slow down Amazon for example. And the Retry-After
header may provide more control (than 406
) with better behaved clients.
: I removed the If-None-Match
guards on the 406
and 429
defences, as the EOU site is no longer generating ETag
s to which those would be a good response.
I also introduced
something a little like the networking RED (Random Early Drop). For those RSS feed clients that do not allow compression, outside skipHours
times I now aim to (kinda) Random Early Drop (RED) a high enough fraction of requests to roughly compensate for the wasted bandwidth in those allowed, ie about 75% or above (typically over 4x compression with gzip
).
But since high bad grid/battery conditions can continue for days, and some of these bad bots may give up with no access over those sorts of scales, as of I make a hole in the 406
defence for an hour at noon when maximum solar power is likely available.
: for the skipHours
times I have strengthened the 429
rejections. A client has to be on its best behaviour to avoid a rejection at these times, and a portion of even 'good' requests are randomly rejected ('RED') on the basis that a 'good' client should not be requesting at all at those times.
This attempts to avoid rejecting human beings signing up for the first time, and/or manually refreshing the feed in the middle of the night. This limits the rejection rate to make a repeat manual operation bearable.
: for all time added a new ~50% random 429
rejection for unconditional requests. All the rejection rules are now more strict and check that any
If-Modified-Since
conditional header has well-formed syntax. The new 429
rule and the 406
rule now have a carve out, ie they are disabled, at noon GMT for an hour, to help allow some of the more brutish once-an-hour semi-bad bots get through. This is when there is more likely to be some solar generation around!
: reduced the ~50% random 429
rejection to ~25%, which seems less harsh on those clients that are playing entirely nicely.
: for the first full 24h () after the last adjustment to 406
and 429
rules the feedStatusByHour.log
summary is:
36 173066 200:304:406:429:SH 11 12 0 13 36 00 36 193682 200:304:406:429:SH 12 9 0 15 36 01 35 156699 200:304:406:429:SH 9 9 0 16 35 02 39 185276 200:304:406:429:SH 10 9 0 20 39 03 34 78026 200:304:406:429:SH 0 0 0 33 34 04 42 115296 200:304:406:429:SH 1 0 0 41 42 05 55 112662 200:304:406:429:SH 0 0 0 55 55 06 43 100411 200:304:406:429:SH 0 0 0 43 43 07 38 189488 200:304:406:429:SH 12 7 9 10 0 08 32 220092 200:304:406:429:SH 10 7 6 9 0 09 34 250525 200:304:406:429:SH 17 4 7 6 0 10 39 319287 200:304:406:429:SH 20 2 6 11 0 11 54 794273 200:304:406:429:SH 48 6 0 0 0 12 56 464185 200:304:406:429:SH 34 4 8 5 0 13 55 445791 200:304:406:429:SH 30 3 6 16 0 14 48 371215 200:304:406:429:SH 24 6 6 12 0 15 44 391576 200:304:406:429:SH 18 5 7 13 0 16 47 277909 200:304:406:429:SH 23 5 6 13 0 17 46 276134 200:304:406:429:SH 25 3 10 8 0 18 38 200453 200:304:406:429:SH 17 3 6 12 0 19 48 263608 200:304:406:429:SH 23 2 10 13 0 20 41 528001 200:304:406:429:SH 25 3 1 12 0 21 40 223109 200:304:406:429:SH 19 4 0 17 40 22 35 203545 200:304:406:429:SH 16 1 0 18 35 23 1015 6534309 200:304:406:429:SH 404 104 88 411 395 ALL
Looking at the data for noon (12, ie 12:00Z to 12:59Z), they suggest that bytes transferred would have been about 3x (300%) higher without defences in place and the request count ~27% higher. (In that noon hour there were 48 200
responses and only 6 304
responses, for 54 total with no error responses, where almost all should have been 304
s.) This still leaves 1015 interactions and ~6.5MB of RSS file pulled in a day with no episode in over a month since , and no changes (even metadata) since the morning of !
For the same day, the heaviest users from feedStatusByUA.log
are:
221 1807388 200:304:406:429:SH 41 0 54 126 92 "iTMS" 206 1032689 200:304:406:429:SH 93 36 0 77 86 "Spotify/1.0" 117 983993 200:304:406:429:SH 82 0 1 34 34 "Amazon Music Podcast" 92 891920 200:304:406:429:SH 77 0 0 15 13 "Podbean/FeedUpdate 2.1" 44 76488 200:304:406:429:SH 2 0 22 20 16 "-"
(Spotify is getting the 'lite' feed at ~10% the size of the full feed.)
: I relaxed the regexes checking If-Modified-Since
to accept 7 May
as well as 07 May
to allow a few more 304
s.
:
the non-skipHours
random-fraction 429
rejections will now only happen during some sort of resource constraint, such as high-grid intensity or low battery.
I also reverted the fancy If-Modified-Since
regexes to just checking for a missing/empty header.
2024-05-14: by UA rate limiting
Today's new rate limiting defence against bots
that make the highest request rates measured over the preceding week or so is to reject essentially all their no-Referer
unconditional requests with 429
s, except at noon (or when the battery is VHIGH
).
In theory a (conditional) request or two per day might be optimal, though one per hour has been customary for RSS feeds, and both full and lite feeds might be polled by a bot, and there might be many users of the same end client. So more than ~400 per ~week is the qualifier for the greedy list, currently, along with being in the top 10 or so by request count.
I expect this to drastically reduce feed-polling bandwidth demands, without any significant loss in service to me or potential podcast listeners.
2024-05-14: defences snapshot
As of today:
- during
skipHours
, no-Referer
unconditional requests, when grid intensity is high or battery is low, anything other than squeaky clean, may be rejected with429
- other than during noon UTC, no-
Referer
unconditional requests, when grid intensity is high or battery is low, will have a small fraction randomly rejected with429
- other than during noon UTC, no-
Referer
unconditional incompressible requests, will almost all be rejected with406
to trim bandwidth towards that that allowing compression would have used - other than during noon UTC (or when the battery is
VHIGH
), no-Referer
unconditional requests, for the most hungry bots with access rates well over hourly for the last week or so, will be rejected with429
The last of these depends on the User-Agent
being set consistently and honestly, and is also potentially unfair to popular podcast clients with many instances/users, so should be considered the most fragile and least favoured defence.
The success Cache-Control
max-age
and failure 429
Retry-After
vary by time of day to try to steer hits away from skipHours
, and are always many hours.
2024-06-08: 503 and rogue Googlebot
Googlebot seems to regard 429
as 'hurry back please', and is polling nearly 1000 times per day, more than 4 times next-place iTunes. So I am testing a defence against the top 3 bots of rejecting their polls during skipHours
with 503
. I am hoping that Googlebot does eventually respond sensibly to 503
, though there is a risk that that will hurt crawling of the entire site.
2024-06-24: Googlebot forcefully stopped
503
s worked no better than 429
s for Googlebot.
Eventually, blocking Googlebot entirely in the robots.txt
, waiting for all its crawling to stop, and then carefully unblocking everything except the feed file, brought things back to normal.
Amusingly Google-Podcast
is showing up with hourly polls (thus also ignoring Cache-Control
etc), even though the Google Podcast service is dead as of about today worldwide.
2024-07-05: 503s instead of 429s
Both Podbean and Amazon respond in some way to 503
s. Amazon also responds to 429
s. Spotify and iTunes do not respond to anything.
So I switched all 429
responses to 503
.
I have also taken the opportunity to at least for now remove the special-case initial very harsh 503
case for very bad bots, and slightly extend one of the ex-429
cases, for some simplification.
2024-08-08: RSS lastBuildDate
I modified the podcast RSS feed lastBuildDate
to reflect the timestamp of the newest primary media file (eg .mp3
) in the feed, after filtering for the 'lite' version.
The lastBuildDate
was previously the latest timestamp of any of the podcast episode HTML source files, with no 'lite' filtering. As I touch the HTML quite often for minor edits, that kept the timestamp newer than needed for pulling new entries, though might help keen metadata updates.
This instantly pushed both normal and lite feeds to look more than a month old.
2024-09-02: config snapshot
This set of protective config (slightly tweaked, to hurt the misbehaving) has been unchanged since .
# Aggressively reject unconditional greedy podcast bots, other than noon. RewriteCond "%{TIME_HOUR}" "!=12" RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond "%{HTTP:User-Agent}" ^$ [OR] RewriteCond expr "-f '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" RewriteCond expr "-s '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" [OR] RewriteCond /run/EXTERNAL_BATTERY_VHIGH.flag !-f RewriteRule "^/rss/.*\.rss$" - [L,END,R=503,E=REDIRECT_RSS_RATE_LIMIT:1] # Aggressively trim unconditional (even non-skipHours) traffic, other than noon. # This is enforced only during a resource shortage (grid high or battery low). # And for the worst bots. RewriteCond "%{TIME_HOUR}" "!=12" RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond expr "'%{md5:%{TIME}blah}' =~ /^[6-9]/" RewriteCond expr "-f '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f [OR] RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f RewriteRule "^/rss/.*\.rss$" - [L,END,R=503,E=REDIRECT_RSS_RATE_LIMIT:1] # Trim skipHours bot unconditional traffic. # During skipHours. RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond %{HTTP:Accept-Encoding} !gzip [OR] RewriteCond %{HTTP:User-Agent} ^$ [OR] RewriteCond expr "-s '%{DOCUMENT_ROOT}/rss/greedybot/%{md5:%{HTTP:User-Agent}}.flag'" [OR] RewriteCond /run/EXTERNAL_BATTERY_LOW.flag -f [OR] RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f RewriteRule "^/rss/.*\.rss$" - [L,END,R=503,E=REDIRECT_RSS_RATE_LIMIT:1] # Reject (bot) attempts to fetch without compression, unconditionally. # Make a small (1h) hole at noon to allow some daily access for semi-bad bots. RewriteCond %{HTTP:Accept-Encoding} !gzip RewriteCond %{HTTP:If-Modified-Since} " 202[0123] " [NV,OR] RewriteCond %{HTTP:If-Modified-Since} ^$ [NV] RewriteCond "%{TIME_HOUR}" "!=12" RewriteCond "%{TIME_HOUR}" "<08" [OR] RewriteCond "%{TIME_HOUR}" ">21" [OR] RewriteCond %{HTTP:User-Agent} ^$ [OR] RewriteCond expr "'%{md5:%{TIME_HOUR}%{TIME_MIN}BLaH}' =~ /^[^046f]/" [OR] RewriteCond %{DOCUMENT_ROOT}/_gridCarbonIntensityGB.7d.red.flag -f RewriteRule "^/rss/.*\.rss$" - [L,END,R=406]
Data Snapshots
Some interesting data emerged while trying to understand interactions with particular clients.
More on data snapshots...
2024-05-07: feed request headers
Over two sampling runs during
2024-05-07/2024-05-08
request headers were collected with Apache's mod_log_forensic
, and some possibly-identifying information was stripped out.
Log lines with headers are of the form (starting with '+
'):
+2390:663a7938:13|GET /_dashboard.html HTTP/2.0|User-Agent:Mozilla/5.0 (Macintos h; Intel Mac OS X 10.15; rv%3a125.0) Gecko/20100101 Firefox/125.0|Accept:text/ht ml,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8|A ccept-Language:en-GB,en;q=0.5|Accept-Encoding:gzip, deflate, br|...
36 unique request signatures out of 155 feed requests (1440 total EOU site requests) were seen.
% wc -l /var/log/apache2/forensic.log 2880 /var/log/apache2/forensic.log % egrep '^[+]' /var/log/apache2/forensic.log | wc -l 1440 % egrep rss/podcast /var/log/apache2/forensic.log | egrep '^[+]' | wc -l 155 % egrep rss/podcast /var/log/apache2/forensic.log | wc -l 155 % egrep rss/podcast /var/log/apache2/forensic.log | sed -e 's/^[^|]*|//' | sort -u | wc -l 36 % egrep rss/podcast /var/log/apache2/forensic.log | sed -e 's/^[^|]*|//' | sort -u GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive GET /rss/podcast-lite.rss HTTP/1.1|User-Agent:Spotify/1.0|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a22 GMT|Accept-Encoding:gzip, x-gzip, deflate|Host:www.earth.org.uk|Connection:keep-alive GET /rss/podcast.rss HTTP/1.0|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/1.1|Accept:application/json, text/plain, */*|User-Agent:axios/1.5.1|Accept-Encoding:gzip, compress, deflate, br|Host:www.earth.org.uk|Connection:close GET /rss/podcast.rss HTTP/1.1|accept:*/*|user-agent:Aggrivator (PodcastIndex.org)/v0.1.7|if-modified-since:Tue, 07 May 2024 09%3a46%3a11 GMT|accept-encoding:gzip|host:www.earth.org.uk GET /rss/podcast.rss HTTP/1.1|Accept:*/*|User-Agent:Mozilla/5.0|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a11 GMT|If-None-Match:"2b71-61675c75db55a"|Accept-Encoding:gzip,deflate|Connection:close|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|User-Agent:HTTPie/3.2.2 GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|Connection:keep-alive|Accept:*/*|From:googlebot(at)googlebot.com|User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http%3a//www.google.com/bot.html)|Accept-Encoding:gzip, deflate, br|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a07 GMT GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|sentry-trace:XXX|baggage:XXX|User-Agent:TPA/1.0.0|Accept-Encoding:gzip, deflate|Accept:*/*|Connection:keep-alive|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a11 GMT|traceparent:XXX|elastic-apm-traceparent:XXX|tracestate:es=s%3a0 GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|sentry-trace:XXX|baggage:XXX|User-Agent:TPA/1.0.0|Accept-Encoding:gzip, deflate|Accept:*/*|Connection:keep-alive|If-Modified-Since:Wed, 01 May 2024 12%3a04%3a28 GMT|traceparent:XXX|elastic-apm-traceparent:XXX|tracestate:es=s%3a0 GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:feedparser/6.0.8 +https%3a//github.com/kurtmckee/feedparser/|Accept-Encoding:gzip, deflate|Accept:application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1|A-Im:feed|Connection:close GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-agent:gPodder/3.11.1 (+http%3a//gpodder.org/) Linux|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a07 GMT|If-None-Match:"239e-6168531446d9c" GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-agent:gPodder/3.11.4 (+http%3a//gpodder.org/) Linux|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|If-Modified-Since:Mon, 29 Apr 2024 09%3a52%3a04 GMT GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|user-agent:Mozilla/5.0 (Linux;) AppleWebKit/ Chrome/ Safari - iHeartRadio|Accept-Encoding:gzip, deflate|Accept:*/*|Connection:keep-alive|if-modified-since:Tue, 03 Jan 2023 20%3a07%3a14 GMT GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv%3a125.0) Gecko/20100101 Firefox/125.0|Accept:*/*|Accept-Language:en-GB,en;q=0.5|Accept-Encoding:gzip, deflate|Cache-control:no-cache|DNT:1|Connection:keep-alive|Cookie:XXX GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36|Accept:*/*|Accept-Encoding:identity|Connection:keep-alive|Accept-Language:en-US;q=0.9, en;q=0.8 GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:NRCAudioIndexer/1.1|sentry-trace:XXX|baggage:XXX|Accept:*/*|Accept-Encoding:gzip, deflate, br GET /./rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:Overcast/1.0 Podcast Sync (0 subscribers; feed-id=XXX; +http%3a//overcast.fm/)|Accept-Encoding:gzip|Connection:close GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:Overcast/1.0 Podcast Sync (3 subscribers; feed-id=XXX; +http%3a//overcast.fm/)|If-Modified-Since:Wed, 08 May 2024 15%3a23%3a27 GMT|Accept-Encoding:gzip|Connection:close GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:Podcasts/1555.2.1 CFNetwork/1237 Darwin/20.4.0|Accept-Encoding:gzip|Connection:close GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:UniversalFeedParser/3.3 +http%3a//feedparser.org/|Accept-Encoding:gzip, deflate|Accept:*/*|Connection:keep-alive|If-Modified-Since:Wed, 01 May 2024 12%3a04%3a28 GMT GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-Agent:Wget/1.24.5|Accept:*/*|Accept-Encoding:gzip|Connection:Keep-Alive GET /rss/podcast.rss HTTP/1.1|User-Agent:Amazon Music Podcast|Host:www.earth.org.uk|Connection:Keep-Alive|Accept-Encoding:gzip,deflate GET /rss/podcast.rss HTTP/1.1|User-Agent:CopernicusBot/1.0|If-None-Match:"15f3a-615e2c8329545-gzip"|Accept-Encoding:gzip;q=1.0,deflate;q=0.6,identity;q=0.3|Accept:*/*|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/1.1|User-Agent:PocketCasts/1.0 (Pocket Casts Feed Parser; +http%3a//pocketcasts.com/)|Accept-Encoding:gzip|Host:www.earth.org.uk|Connection:close GET /rss/podcast.rss HTTP/1.1|User-Agent:PocketCasts/1.0 (Pocket Casts Feed Parser; +http%3a//pocketcasts.com/)|Accept-Encoding:gzip|Host:www.earth.org.uk|Connection:Keep-Alive GET /rss/podcast.rss HTTP/2.0|Accept-Encoding:gzip, deflate, br|From:bingbot(at)microsoft.com|Accept:*/*|User-Agent:Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http%3a//www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|Accept-Encoding:gzip, deflate|If-Modified-Since:Tue, 7 May 2024 09%3a46%3a11 GMT|Cache-Control:public|User-Agent:okhttp/4.9.3|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|Accept-Encoding:gzip|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a11 GMT|User-Agent:SpaceCowboys Android RSS Reader / 2.6.24(309)|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|Accept:*/*|User-Agent:Podbean/FeedUpdate 2.1|Accept-Encoding:gzip|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|User-Agent:fyyd-poll-1/0.5|Accept:*/*|Accept-Encoding:gzip,deflate|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a11 UTC|Accept-Encoding:gzip|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36|If-Modified-Since:Tue, 23 Apr 2024 10%3a51%3a23 UTC|Accept-Encoding:gzip|Host:www.earth.org.uk GET /rss/podcast.rss HTTP/2.0|User-Agent:Podbean/FeedUpdate 2.1|Accept:*/*|Accept-Encoding:gzip|Host:www.earth.org.uk HEAD /rss/podcast.rss HTTP/1.1|host:www.earth.org.uk|connection:Close HEAD /rss/podcast.rss HTTP/1.1|User-Agent:iTMS|Host:www.earth.org.uk|grpc-timeout:30000m
This does not include all clients seen over the course of a typical day, eg those that poll less frequently.
2024-06-13: distinct UAs
The number of distinct User-Agent
s does not have a clear trend (in this case a line for each plus ALL
):
% wc -l 2024????/feedStatusByUA.log 130 20240420/feedStatusByUA.log 124 20240421/feedStatusByUA.log 134 20240429/feedStatusByUA.log 116 20240505/feedStatusByUA.log 126 20240513/feedStatusByUA.log 111 20240519/feedStatusByUA.log 160 20240527/feedStatusByUA.log 119 20240602/feedStatusByUA.log 139 20240610/feedStatusByUA.log
2024-06-16: continuing Googlebot rampage
The top few UAs (headed with the ALL
total) by hits/bytes for the 6 days ending this morning are:
13692 69163809 200:304:406:429:503:SH 2524 811 314 5233 4777 5732 ALL 5984 27279430 200:304:406:429:503:SH 382 16 0 2568 3012 2937 "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 1506 9935875 200:304:406:429:503:SH 178 3 155 535 635 616 "iTMS" 1254 2411116 200:304:406:429:503:SH 150 0 0 713 391 313 "Podbean/FeedUpdate 2.1" 1235 5070232 200:304:406:429:503:SH 145 261 0 331 498 512 "Spotify/1.0" 594 1514089 200:304:406:429:503:SH 108 0 0 245 241 247 "Gofeed/1.0" 480 3016315 200:304:406:429:503:SH 232 0 0 248 0 112 "Amazon Music Podcast"
Googlebot has still not regained its sanity over this feed file...
The 3 304
s against iTMS were likely from validators, not Apple. If Spotify can do conditional GET
s then Apple can too. And Podbean. And Amazon. And whatever Gofeed/1.0
is...
2024-06-23
In desperation at ~09:24Z I have blocked Googlebot from the whole EOU site in robots.txt
, temporarily, to see if that stops the stupidity.
User-agent: Googlebot Disallow: /
That did, after a few hours, stop crawling entirely. So I re-enabled crawling for Googlebot except for /rss/podcast.rss
. (Blocking that alone the night before did not achieve anything, it seemed.)
2024-06-27
Note from me to contact in Google:
I was only able to 'resolve' the issue in the end by banning Googlebot *entirely* from the site in robots.txt, waiting for all crawling by plan Googlebot to stop (~3h), reinstating the block for the single URL in place of the full ban, and waiting for Googlebot to start up again (>~8h).
The behaviour around 503s and 429s is horrible, out of spec, a waste of resourcs all round, ... Please fix!
2024-07-03
FWIW
Google-Podcast
now seems to have gone potty [~6 rapid-fire requests per hour even in the face of429
s/503
s], even though it isn't even meant to be a thing any more. Some of your engineering colleagues should use the Retry-After header in 429 reponses rather than madly retrying many times in a tight loop!
[03/Jul/2024:10:07:02 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 429 3742 "-" "Google-Podcast" [03/Jul/2024:10:07:02 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 887 "-" "Google-Podcast" [03/Jul/2024:10:07:07 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 429 3742 "-" "Google-Podcast" [03/Jul/2024:10:07:07 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 887 "-" "Google-Podcast" [03/Jul/2024:10:07:21 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 429 3742 "-" "Google-Podcast" [03/Jul/2024:10:07:21 +0000] "GET /rss/podcast.rss HTTP/1.1" 429 887 "-" "Google-Podcast"
2024-07-05
Google-Podcast
responds no better to 503
s:
[05/Jul/2024:21:07:03 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3539 "-" "Google-Podcast" [05/Jul/2024:21:07:04 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Google-Podcast" [05/Jul/2024:21:07:09 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3539 "-" "Google-Podcast" [05/Jul/2024:21:07:09 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Google-Podcast" [05/Jul/2024:21:07:19 +0000] "HEAD /rss/podcast.rss HTTP/1.1" 503 3539 "-" "Google-Podcast" [05/Jul/2024:21:07:20 +0000] "GET /rss/podcast.rss HTTP/1.1" 503 3859 "-" "Google-Podcast"
It is annoyingly lively for a service nominally turned off world-wide some time ago...
Sonification
I would like to turn some of the observed data into sound to understand it ... and to make a podcast episode or three!
Listen to sonifications 1 and 2.
More on sonification...
2024-05-27: initial sonification thoughts
Mapping a day (ie 24h) onto 24 notes (6 bars / 3s) or 24 bars (12s) seems a good start. Some data is available from April and May so far, thus ~60d, thus 1800s (30 minutes) or 120 minutes. A bit long, but not impossible.
One idea is to map each User-Agent
, at least the significant ones, to a separate and distinctive instrument (eg sax, flute, synth) and/or a distinct note (on some scale to be picked). Have note intensity or duration in each slot correspond (say) bandwidth or number of hits in the interval covered and represented by the note, or the fraction of good (200
or 304
) hits vs bad/rejected (4xx
) hits. Maybe an octave down for bad?
Another is for the consolidated results in each interval to map result codes to (say) a pentatonic scale, first a week at a time for 8 sets of 6 bars, ie 24s. Which might make for a good overview. Data at data/2024????/feedStatusByHour.log
within the dataset.
Or drum beat pairs with velocity proportional to bytes/hits per hour slot. For data up to , prepared with statsHouse-5.3.1.min.jar -feedHitsSummary
type 1, and with MP3 generated by GarageBand 10.4.11, that sounds like:
With a tweak to play low toms for hits in skipHours
(ie night!), with V5.3.2:
Using V5.3.3 exported .dat
output, in the spirit of [zong2024umwelt] audio-led rendering, via gnuplot
WIP script (snapshot):
That rise in hits/h at the end is mainly Googlebot going rogue!
Interestingly the .mid
, .dat
, and (smaller) .png
are fairly close in size, which probably indicates something about information content and density.
3652 img/research/RSS-efficiency/out/audio/20240604-byHourSummary1.mid 3810 img/research/RSS-efficiency/out/audio/20240604-byHourSummary1.dat 4087 img/research/RSS-efficiency/out/audio/20240608-byHourSummary1.lb.png
As California and Prog rock (debut tracks for "1 Gig Big" with RN):
References
- [boelen2024cool] RSS is cool! Some RSS feed readers are not (yet)...
- [chun2010connections] Government 2.0: Making connections between citizens, data and government
- [everman2018GreenWeb] GreenWeb: Hosting High-Load Websites Using Low-Power Servers
- [gray2020podcast] What is a Podcast?
- [GreenSoftware] Green software
- [hinton2011internet] Power consumption and energy efficiency in the internet
- [IETFRFC9110] HTTP Semantics
- [IETFRFC9111] HTTP Caching
- [jones2021podcast] Current Challenges and Future Directions in Podcast Information Access, DOI=10.1145/3404835.3462805
- [koomey2021internet] Does not compute: Avoiding pitfalls assessing the Internet's energy and carbon impacts
- [kroll2024practices] feed reader best practices
- [kroll2024roundup] Another feed reader score roundup
- [krug2014networking] Understanding the environmental costs of fixed line networking
- [LN2024podcasts] Podcast Stats: How many podcasts are there?
- [ofcom2024listening] Audio listening in the UK
- [ORSStracker] Open RSS Issues
- [ou2012ARM] Energy- and Cost-Efficiency Analysis of ARM-Based Clusters
- [PBJ2024creation] Podcast creation [archive 2024-04-19]
- [podcastindustryinsights] Podcast Industry Insights
- [podcast-standard] Podcast de facto Standard
- [RAB2009RSS] RSS 2.0 Specification
- [rime2022podcast] What is a podcast? Considering innovations in podcasting through the six-tensions framework
- [steven2023solar] Building and Monitoring a Solar-Powered Web Server
- [sullivan2019platforms] The platforms of podcasting: Past and present
- [varghese2014greening] Greening web servers: A case for ultra low-power web servers
- [zong2024umwelt] Umwelt: Accessible Structured Editing of Multi-Modal Data Representations
(Count: 26)