Earth Notes: On Website Technicals (2022-12)

Updated 2024-01-25 20:00 GMT.
By Damon Hart-Davis.
Tech updates: citationID, Mastostorm, server down, mail system move, dataset archive, database cross-check, static Gallery, Xmas slump.
tools
Further bibliography fun is in hand, and a long-overdue resurrection of the Multimedia Gallery...

2022-12-29: Christmas Slump

Screenshot 20221229 GSC Performance clicks Xmas slump
Slump in Google search clicks as reported by GSC (Google Search Console) with nadir on 25th December starting approximately Thursday 15th when people may have stopped searching from work!

2022-12-18: Static Gallery

cactus flowers: one of the first Gallery exhibits

I now have a minimal static Gallery site laid out on sencha (the new-ish main Raspberry Pi server).

Minimal: a home page, raw exhibit files, and HTML exhibit landing page for each exhibit. (The accession files beside the exhibits are also available.) Lots of best practice absent at this scale.

There is also dark mode support due to some header stuff copied from EOU!

All of this is done with fairly bare scripts and Posix utilities, eg find and rsync.

A little more patching up of missing exhibits, and removing of bogus ones, happened. A more through check of possible loss and damage (eg with MD5) is needed.

In the middle of all of this sencha rebooted, after well over 200 days uptime. I should probably fsck soon.

Next up was to have sencha serve the Gallery's IP address, and have Apache serve the site and any aliases. Simplifying the cruft in DNS down to the obvious gallery.hd.org and www.gallery.hd.org and pointing them at the existing primary address seems good! (The Gallery need not have a unique IP address.)

Amazingly, !

With the very basics working, I'm doing some tidy-up at the edges. Some of this is trimming state space in the search engines' heads.

  • Redirect (301, permanent) of any URL with a hostname that is not the canonical gallery.hd.org to the canonical.
  • Kill off all (now defunct) query parameters by redirect (301, permanent) of a URL with any to a version of the URL stripped of them.
  • Exhibit cache life set to be one year (ie nominally forever); for now everything else (other than site furniture) is a little over a month.
  • Created robots.txt partly to rein in bad bots, but also to link to a sitemap in due course.
  • Created sitemap.xml, initially containing only the loc entry per URL to keep things fast and simple. Google has been having difficulty reading anything over ~5k entries (Couldn't fetch), but Bing seems fine, and various verifiers are happy with it...
  • Set up DNS and Apache to accept inbound links from many of the old aliases/mirrors, which is then redirected to the canonical URL.
Screenshot 20221229 GSC Sitemaps gallery indigestion
GSC visibly having difficulty loading any significant fraction of the ~40k-entry static XML sitemap for the Gallery. The nk versions are about the first n000 entries from the full map.

2022-12-12: Gallery Patching

I expect a sprawling archive such as the Gallery's to accumulate bit rot.

In an attempt to detect this, I compared the new and old copies of the exhibit database file trees byte-by-byte:

% rsync -nirc /local/galleryDB/photos/ /mnt2/galleryDB/photos
skipping non-regular file "_i18n"
skipping non-regular file "locationDB.properties"
>fc.T...... clothing/_more2012/_more09/child-clothes-4-6-four-to-six-girl-and-2-3-two-to-three-boy-to-pass-on-or-give-to-school-or-charity-shops-skirts-pyjamas-trousers-slippers-cardigans-jeans-socks-shirts-32-JR.jpg
>f+++++++++ light/_more2021/_more05/.accession.LED-multicolour-WiFi-LIFX-Mini-Colour-pendant-in-lampshade-green-1-DHD.jpg.xml
>f+++++++++ light/_more2021/_more05/LED-multicolour-WiFi-LIFX-Mini-Colour-pendant-in-lampshade-green-1-DHD.jpg
>f+++++++++ mechanoids/_more2020/_more11/phone-cordless-Siemens-Gigaset-AL180-ECO DECT-1-DHD.jpg
>f+++++++++ mechanoids/_more2020/_more11/phone-cordless-Siemens-Gigaset-AL180-ECO DECT-2-DHD.jpg
>fc.T...... places-and-sights/_more2005/_more12/England-London-Kingston-Market-Place-German-Christmas-market-stalls-shopping-gifts-handmade-presents-toys-sweets-food-trinkets-back-12-DHD.jpg
>fc.T...... places-and-sights/_more2011/_more08/England-Isle-of-Wight-Sandown-Zoo-interesting-lepidopteran-black-with-white-striped-resting-on-bullet-point-on-signboard-6-DHD.jpg
>fc.T...... places-and-sights/_more2020/_more05/England-London-Kingston-Bonner-Hill-Cemetery-on-sunny-May-bank-holiday-lockdown-quiet-bright-green-flowers-grass-trees-20200525-114652-DHD.jpg

The current copy of the child-clothes image appears to be intact. The old one appears broken (truncated). (Others marked >c.T seem more subtly damaged/changed.)

The LED-multicolour-WiFi image appears to be of flowers in grass. It appears to be redundant, so has been removed along with the accession file.

The ECO DECT (note the space rather than dash) files appear to be duplicates of files with dashes, so have been manually removed.

The current copy of the England-London-Kingston-Market-Place image seems to be intact.

The current copy of the England-Isle-of-Wight image seems to be intact.

The current copy of the England-London-Kingston-Bonner-Hill image seems to be intact.

2022-12-11: Archive green

I have taken fresh 'solid' LZMA2 full archives with xz of some of the key data sets that were collected on the just-decommissioned RPi 2B+ host green.

These were:

  • CPU temperature records of green itself: data/RPi/cputemp/202011-RPi2.log.xz to data/RPi/cputemp/202212-RPi2.log.xz
  • Frequent-sample extracts for Enphase: data/16WWHiRes/Enphase/adhoc/Enphase-20180809-to-20200921.log.xz
  • Daily full JSON samples for Enphase: data/16WWHiRes/Enphase/adhoc/Enphase-20180817-to-20200920.production.json.tar.xz
  • Local raw old-format temperature records at the OpenTRV (REV2) receiver: data/OpenTRV/pubarchive/localtemp/20140418-to-20200821.log.xz
  • Remote old-format 16WW OpenTRV data: data/OpenTRV/pubarchive/remote/20150601-to-20160912.log.xz
  • Remote new-format 16WW OpenTRV data: data/OpenTRV/pubarchive/remote/20150601-to-20200821.json.xz
  • 10-minute logs from powermng: data/powermng/powermng-green-logs-20150101-to-20221206.xz

These mainly run until when sencha became the primary data-collection and logging server.

Some data, such as the 1-minute SunnyBeam generation logs, had already been archived.

Most of these should in principle be triply-redundant. Anything valuable should already have been captured. However, undetected corruption of data, including missed logs, may have happened, and this provides another route for data recovery.

Also the solid continuous form of some of these archives may be useful.

Card adaptor power suck

It seems that mounting the old filesystems (read-only) on the RPi 3B via my micro SD card USB adaptor chain was using as much power as keeping the entire old server running (~900mW) according to measurements with powermng. Which points out how efficient the old RPi was, and how inefficient random bits of consumer tech can be.

I've now adjusted the footer on every page back to saying server consumption is ~1W, rather than the ~2W that has been there for ~2Y!

Per-CPU power

Observation suggests that the RPi 3B (sencha) uses ~1W extra per busy CPU.

2022-12-10: Nano-optimisation

The HTML 5 minimiser does not seem to be reliably doing one of the operations that it is set to: sorting the class names in a class="..." attribute.

Doing so may slightly improve compression. In any case it should not hurt.

I manually edited various scripts that are generating HTML so as to generate class attributes with the class names already sorted.

2022-12-07: You've Got Mail!

Well, I do not! The RPi 2 (B+) that was running mail (sendmail and dovecot) and other things is now sitting on my desk. I have removed its micro SD card and mounted relevant partitions (read-only!) via a USB reader on the production RPi 3B to be able to quickly access and copy over files and configuration.

% df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/sda2       3.5G  3.0G  409M  88% /mnt
/dev/sda3       115G  107G  2.4G  98% /mnt2

Looking first at dovecot config differences between the old (2B+) and new as-shipped main config I have decided to preserve all the changes in /etc/dovecot/conf.d/10-master.conf to improve security and reduce resource use. The changes also turn off IMAP service, and plain unprotected POP3, and enable POP3S on port 995. I added port = 0 to both normal and secure IMAP config to try to ensure that only POP3 is activated.

In /etc/dovecot/conf.d/10-ssl.conf I am requiring SSL. I have to provide the cert and key .pem files to make this work.

I have obtained and modified dovecot-openssl.cnf and mkcert.sh to do this, see for example SSL certificate creation.

At this point dovecot will start, listening on port pop3s (995).

I changed in DNS pop3.exnet.com to point to the RPi 3B.

On my MacBook I turned WiFi off and on and restarted my mail client to flush DNS caches.

My MacBook mail client was then able to connect to the POP3 (RPi 3B) server and collect the dross that had been accumulating in the mailbox for me there!

(This process was actually more messy, and had help from the MBA client's "Connection Doctor" and telnet and various logs, etc!)

Note that these changes feel too intricate/delicate to enforce via Ansible, at least for now. So I will have to re-do by hand these bits of config when moving mail next time, as things currently stand.

sendmail

I folded in my previous /etc/aliases to that of the RPi 3B, and aliased another old admin ID to route to me.

I ported across my sendmail configuration more-or-less as-is. There seemed to be a couple of new features, documented as 'safe', that I allowed / set up. I turned off allmasquerade however.

One gotcha that got me again is sendmail hanging, eg when trying to rebuild aliases, due to /etc/hosts not giving the fully-qualified domain name for itself. This is bad:

X.X.X.X    sencha

This is good, and lets sendmail get stuff done rather than repeatedly sleeping for 60s hoping that things will change:

X.X.X.X    sencha.exnet.com sencha

Due to the sheer volume of SPAM attempts (at least historically), partly to reduce wear on the SD card, some mail-related logging should be reduced in /etc/rsyslog.conf.

Also, an oddity appearing regularly in the mail error logs, though apparently otherwise harmless:

Dec  8 14:13:19 sencha dovecot: log: Error: Received master input for invalid service_fd XX: XX nnnnn BYE
Dec  8 14:13:20 sencha dovecot: log: Error: Received master input for invalid service_fd YY: YY mmmm BYE
Dec  8 14:13:21 sencha dovecot: log: Error: Received master input for invalid service_fd ZZ: ZZ nnnnn BYE

In any case, mail seems to be broadly working , hurrah!

Also this seems like a good idea, to allow service startup on reboot!

% sudo systemctl enable dovecot.service
Synchronizing state of dovecot.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable dovecot
% sudo systemctl enable sendmail.service
sendmail.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable sendmail

2022-12-05: Machine Down!

Mail server (green) not responding after power cycling last night and this morning; may require TLC plugged into TV. Long overdue (more than a year) to move away DNS and mail and DNS... It seems that the SD card interface is glitchy (which may explain observed behaviour over a long time). A few errors had also accumulated, which fsck sorted. Now I am spending several hours with fsck -c -v checking for bad blocks (I don't know if this even works)...

I took the hint to move pekoe to its newly working wired connection officially. That means, amongst other things, pekoe can take over some permanent services from green, such as being a DNS secondary. I'll aim to get that working tomorrow, with the magic of virtual network interfaces meaning that pekoe can in principle take over green's IP address for that service, so all glue records at the various registrars can stay as-is.

Back Up

Even leaving the RPi running the fsck -c -v overnight did not fix things. Some more hand-holding in the morning brought the machine back to a runnable state.

I have moved DNS and NTP services from green to other hosts. I will move POP3/SMTP soon (they are off-line again for a bit). Other services may change, eg the Gallery may initially reappear as a completely static site on sencha.

2022-12-02: [[ citationID ]]

It is now possible to drop into any main page a [[citationID]] or cite tag, at most one per source line, and magic will happen.

Firstly, the reference gets rendered as [citationID], with a link back to the bibliography.html page full reference.

Secondly, a References section is created at the end of the page, with a sorted, de-duplicated list of references. Each of these is linked back to the bibliography.html page. Each also has the title shown. If there is a URL available then the title directly links to it, to help external document access be as simple and direct as possible.

Here is an example: [hart-davis202216ww]

It is also possible to create a <!-- GHOSTREF [[citationID]] --> record on a line by itself to force creation of a References entry without anything else visible in the text.

These citation IDs are generally all lower-case. For accessibility (a11y), where sections consist of two or more adjacent concatenated words not separated by digits or punctuation, camel case (initial capital) should be used on the second and subsequent concatenated words. [santana2020camel] This is expected (for example) to help screen-readers.

Mastostorm

It was mentioned that when a link is included in a Mastodon post, all instances that see that post, because of followers there, GET the post, thus creating a surge of activity on the server.

I performed a little experiment at 17:42Z, and here is a slightly anonymised sample from my logs. (I do not believe Mastodon server instance names to be private.)

[04/Dec/2022:17:42:43 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mas.to/) Bot"
[04/Dec/2022:17:42:46 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://dju.social/) Bot"
[04/Dec/2022:17:42:53 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.green/) Bot"
[04/Dec/2022:17:42:56 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://toot.wales/) Bot"
[04/Dec/2022:17:42:57 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.me.uk/) Bot"
[04/Dec/2022:17:43:04 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://ohai.social/) Bot"
[04/Dec/2022:17:43:10 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.energy/) Bot"
[04/Dec/2022:17:43:13 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://macaw.social/) Bot"
[04/Dec/2022:17:43:13 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://toot.community/) Bot"
[04/Dec/2022:17:43:14 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.scot/) Bot"
[04/Dec/2022:17:43:15 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.art/) Bot"
[04/Dec/2022:17:43:18 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mstdn.social/) Bot"
[04/Dec/2022:17:43:19 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.org.uk/) Bot"
[04/Dec/2022:17:43:19 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.0.4 (Mastodon/3.5.5; +https://mastodonapp.uk/) Bot"
[04/Dec/2022:17:43:21 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://fediscience.org/) Bot"
[04/Dec/2022:17:43:24 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://c.im/) Bot"
[04/Dec/2022:17:43:25 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://fosstodon.org/) Bot"
[04/Dec/2022:17:43:26 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://bayes.club/) Bot"
[04/Dec/2022:17:43:27 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://dataprotection.social/) Bot"
[04/Dec/2022:17:43:27 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://nerdculture.de/) Bot"
[04/Dec/2022:17:43:32 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://mastodon.online/) Bot"
[04/Dec/2022:17:43:44 +0000] "GET /bibliography.html HTTP/1.1" "http.rb/5.1.0 (Mastodon/4.0.2; +https://chaos.social/) Bot"

More than 20 hits over a minute in this case. Not that intensive, but worth bearing in mind, especially as the count of following instances rises. Still, the server caches should be nice and warm for the later hits!

References

(Count: 2)

~2034 words.