Earth Notes: On Website Technicals (2019/02)

Tech updates: micro-optimisation, isBasedOn, misuse of link rel=prev/next, AMP half-indexed, Google-, soft 404...
This month may prove to be all about getting indexed the ~50% of AMP pages that were not so at the end of January... (Up to ~60% as of 18th...)

2019/02/18: Soft 404

I am puzzled by Google reporting (in GSC) files such as, with a MIME type in the HTTP header of text/csv, as "Soft 404". There's nothing '404' about it: it's clearly a data file, and present, and behaving as expected, not a missing HTML document for example.

2019/02/13: Google-

Since Google+ is going away in March/April I have removed the social media button for it from desktop/lite pages. (AMP uses a different mechanism.)

While I am having fun, and to save more page weight, I removed the RSS button, since I saw no evidence of it being used.

Thank you again to Share42 for the script and buttons, to TinyPNG for minifying the icons, and to zopfli for minimising the pre-compressed JavaScript!

Page weight (on first load) should now have dropped by more than 180 bytes.

I will probably tidy up the appearance of the float box that includes the now-shorter button bar, in due course...

(The old and new versions of the button bar have distinct paths so that they can coexist as pages are gradually rebuilt and/or old ones live in various caches. At some point I may remove the older files for tidyness. Note also that the desktop and lite JavaScript files, though under different paths, each on their own site to avoid security snafus, are the same object in the repository.)

2019/02/10: AMP 50% Indexed

20190210 AMP 50pc pages indexed from GSC Enhancements view

AMP pages marked as valid/indexed has been wobbling around the 100 (ie ~50%) mark for many days. Note that only one residual AMP error is being reported. (This one apparently from Google's "crawl issue" internal bug still.) All main canonical pages as listed in sitemap.xml are reported as indexed. So it puzzles me why half the AMP version aren't.

2019/02/09: Holding it Wrong: link rel= prev/next

I've been linking sets of pages together, such as in this sequence of tech notes, with manual links in the page body and link rel prev and next in the head. It's slightly tiresome and error-prone work.

Also, the link rel part seems simply to be wrong, eg from "Indicating paginated content to Google":

Note: You should not use this technique merely to indicate a reading list of an article series; you should use this to indicate a single long piece of content that is broken into multiple pages.

I've read various things on this topic, but this seems to be the clearest statement so far.

I've manually removed a couple of manual prev/next pairs between individual article headers as a small quick test and improvement.

But I'd like to do something more systematic for the long series that I have. Eg some fixed metadata that does the right thing in the body of the page, and whatever is appropriate (but probably not prev/next) in the head.

Happily this may trim the head/CRP for all the affected pages. It should certainly save me some manual boilerplate hacking and maint over time!

Now for pages marked as SERIES, I automatically insert previous and next links, and breadcrumb structured data, with a link to the head/unnumbered page if extant: Breadcrumb. I'm still tweaking the appearance of the resulting early sidebar.

2019/02/03: ImageObject isBasedOn

For hero images used in EOU and derived from external sources, and for which I have a credit/discussion .txt file, I have made two enhancements.

The .txt link now gets a itemprop=discussionUrl. I'm not sure if the semantics are quite right, but it's close.

If the .txt file contains a line of the form isBasedOn: URL then a 'src' link is made after the 'i' link to the given URL with a itemprop=isBasedOn.

isBasedOn Example

Here is a snippet from the foot of the desktop/canonical version of this page as of writing, with some whitespace added for readability:

<strong id=pgMedia>Page Media</strong>:
<span itemprop=image itemscope itemtype=><meta itemprop=width content=1280><meta itemprop=height content=1192>
<a href=img/tools-1280w.png itemprop=url>image</a>
(<a href=img/tools-1280w.png.txt itemprop=discussionUrl>i</a>/
<a href= itemprop=isBasedOn>src</a>)</span>.

2019/02/02: Micro-optimisation

Last month I managed to squeak the head/CRP for a particular page under the limit to retain its Twitter video player card, etc.

This was in part through assuming that the embedded player video URL, eg, would not need quoting when used as an attribute value. For this it must not contain spaces nor quotes nor a '>' closing angle bracket.

At the time I could not be sure that the URL would never end in a '/' (slash). If one did, it would not be safe to use unquoted in an attribute at the end of an HTML tag ie ... attrname=value>.

I rearranged the attributes so as to have the URL-containing one not last. But that inconsistency in attribute ordering reduces compressibility.

Today I added checks for raw and Twitter player URL safety, and put the attributes back in the same order that I use elsewhere. The uncompressed form of the page preamble/head/CRP is exactly the same size and semantic content, but the gzip -8 and zopfli output is slightly smaller. The pre-compressed version is made with zopfli, but the CRP size is tested with gzip -8, and the desktop page threshold is currently 1260, aiming to allow some meaningful body text into the first TCP frame sent, after HTTP/1.1 headers.

VersionUncompressed bytesgzip -8 byteszopfli bytes