To be updated intermittently...
Having been egged on by Nibbler: a free tool for testing websites I now have simple print CSS support; basically it hides anything not appropriate for a printed copy such as site navigation, ads and search. Simples.
The Apache configuration had been trimmed right down to conserve memory; I have beefed it up a bit just in case the site suddenly becomes popular. Also, I discovered a config error in passing (trying to serve a site for which the IP address is not even local) which I cleaned up!
I have been wondering whether the XML sitemap lastmod element should
reflect (significant) content updates to page content, or the actual
timestamp of the page which may change for purely stylistic updates
or even just to keep
make happy in effect.
I'd prefer the former, like the HTML ETag 'weak' validation semantics, and happily Google's Webmaster Central Blog (Oct 2014) Best practices for XML sitemaps & RSS/Atom feeds says that the lastmod value should reflect "the last time the content of the page changed meaningfully."
So I have updated my sitemap generator to use the source file date rather than the output file date, which also means that it can depend on the input rather than output files in the makefile (ahoy extra parallelism).
Even when sticking with dates (no timestamps, since intra-day changes are not hugely meaningful for this site), the size of the compressed data (ie gziped over the wire) can be expected to go up as there will usually be more date variation now.
% ls -al sitemap.xml 15813 Jun 17 17:19 sitemap.xml % gzip -v6 < sitemap.xml | wc -c 85.7% 2285
% ls -al sitemap.xml 15813 Jun 17 17:35 sitemap.xml % gzip -v6 < sitemap.xml | wc -c 84.6% 2457
I note that the best practices document suggests pinging Google (and presumably other search engines too) after updating the sitemap. That could be automated to be done (say) overnight at most once per day, to avoid multiple pings as I do a stream of micro-updates, though I think that Google typically does recheck the sitemap daily anyway, from recent observations.
I've also being working on improving the semantic structure of the
generated HTML pages, eg with
and trying to ensure that 'outliner' output looks sensible too.
That should help both search engines and anyone with a screen reader.
Today's displacement activity has been extending the makefile to create/update an XML sitemap whenever one of the main HTML pages is updated.
At the moment because this is in the main edit-generate-edit cycle while I am hacking a page, and is not instant, and because Google seems to be refusing explicitly to index my mobile/alternate pages anyway, I'm only doing this for the desktop/canonical pages for now.
# XML sitemap with update times (for generated HTML files). # Main site; core pages + auto-updated. sitemap.xml: makefile $(PAGES) @echo "Rebuilding $@" @lockfile -r 1 -l 120 $@.lock @/bin/rm -f $@.tmp @echo>$@.tmp '<?xml version="1.0" encoding="utf-8"?>' @echo>>$@.tmp '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">' @for f in $(URLLISTEXT); do \ echo '<url><loc>'$(URLLISTPREFIX)$$f'</loc><changefreq>hourly</changefreq></url>'; \ done >>$@.tmp @for f in $(PAGES); do \ echo '<url><loc>'$(URLLISTPREFIX)$$f'</loc><lastmod>'`date -r$$f -u +'%Y-%m-%d'`'</lastmod></url>'; \ done | (export LC_ALL=C; sort) >>$@.tmp @echo>>$@.tmp '</urlset>' @-chmod -f u+w $@ @chmod -f 644 $@.tmp @/bin/mv $@.tmp $@ @chmod a+r,a-wx $@ @/bin/rm -f $@.lock $@.tmp all:: sitemap.xml
The main intent is to use the lastmod flag to quickly and efficiently signal to crawlers and search engines that particular pages have been updated and should be recrawled soon, rather than them having to guess when to respider and check.
For the auto-updating pages (GB grid carbon intensity) I am using changefreq (at 'hourly', even though it's really every 10 minutes) instead of lastmod, in part so that the XML sitemap does not need to be updated whenever one of them is changed.
Although the raw XML file is much larger than the simple URL list, after compression the difference is much less marked.
% ls urllist.txt sitemap.xml 15612 Jun 15 09:10 sitemap.xml 8285 Jun 15 09:10 urllist.txt % gzip -v6 < urllist.txt | wc -c 75.7% 2032 % gzip -v6 < sitemap.xml | wc -c 85.4% 2298
The XML file should be updated after every page refresh to capture the lastmod signal for crawlers, whereas the urllist.txt file only needs updating when the set of HTML pages changes, eg when a new article is created.
I also added a robots 'noindex' meta tag to the site guide (aka HTML site map) to try to keep it out of the search engines, since it's not very useful to a visitor direct from such an engine. Likewise for the 'other links' page.
It suddenly occurred to me in the bath that EOU / Earth.Org.UK / Earth Notes really is 10 (and a bit) years old.
And yes, the page is a bit heavy, but we can blow 120kB on then-and-now screenshots every decade; party like it's 2099!
And yes, there is a broken link in that 2007 screenshot. Here's the missing image!
The basic site structure has been kept fairly simple, with all the main pages at top level, which thus made creation of the parallel mobile (m.) site relatively easy. Once upon a time when operating systems scanned directories linearly the ~320 current entires in that master directory might have resulted in a speed penalty to serve pages, but with filesystem cacheing and other smarts, less so.
Most anything other than HTML objects have now been moved out of
the top directory, for example images and other immutable stuff
img/, and updating graphs and the like under
out/, and data sets (growing/static) under
The HTTP server provides extended expiries for objects under
and slower-updating objects under
out/ to help cacheing.
To grow from a 1-page site to a more complicated 100+ page site with consistent headers and footers and look and feel as required increasing use of CSS (currently kept very small, and inlined), and other meta-data at/near each raw page header.
<h1>Earth Notes is 10!</h1> <div class="pgdescription">10 years of getting greener...</div> <!-- meta itemprop="datePublished" content="2017-06-11" --> <!-- SQTN img/EOUis10/10.png --> <!-- EXTCSS img/css/fullw-20170606.css -->
These lines are extracted from the raw internal HTML source, stripped out, and reconstituted into:
- The page title used in various ways and a wrapped-up new H1 tag.
- A description and sub-head and other uses.
- A first-publication date in various places, including the footer.
- The page 'image' for social media and structured data / microdata.
- Some extra CSS injected into the page head.
The first four of those allow better support for various forms of page markup, social media and microdata for search engines.
The last allows me to inject a tiny bit of extra (and versioned) CSS into the page header to allow the screenshots to expand out of the normal page container to up to the full viewport for newer browsers.
Ah yes, page containers.
Until recently EOU was a fully fluid layout which didn't work very
well on either very wide or very narrow devices. So first I added
the standard boilerplate
header, and then I wrapped up the body in a
div container with an eye-friendly
and I also made images responsive in a number of ways from
max-width of 100% for big images (or 50% or 33% for floats)
up to playing with
To also optimise the mobile version of the site there are directives to select bits of the HTML for only desktop (or mobile), usually to omit some of the heavier and less-important stuff for mobile. Also, for a couple of things like favicon.ico and some of the social media buttons support, to minimise round-trips during page-load for mobile eg for redirects, there are copies of a couple of key objects on the m. site.
Oh, and today's playtime is splitting up my sitemap into HTML pages
and data directory indexes so that I can track search engine indexing
better (and because it keeps my makefile simpler), which means
that I also now have two
Sitemap entries in my
(PS. The British Library is busy crawling its nominally annual copy of the site at about an object per second or a little less. That could cover the ~3000 current data files in under an hour, but somehow is taking much longer!)
With much of the page microdata markup there is a choice of adding/extending the HTML tags, or adding JSON-LD script elements.
An advantage of the HTML route is that it is potentially easier to ensure that it is kept in sync with what is being marked up, if it is on the page.
For data sets not on the page, JSON-LD may be better by allowing more detail to be provided than makes sense to display in the HTML page, and Google et all are unlikely to assume 'cloaking', ie showing the search engines something different than the user sees, which used to be a staple of "WebSPAM" and "Made-For-AdSense" ("MFA").
In all cases I want to meet the intent of the ExNet style guide which is to have the above-the-fold content/information rendered within the first ~10kB delivered to the browser so that the user perceives speed. To this end, whichever format allows me to move some of the meta-data later in the page text delivery, below the fold or at the end, is potentially better, since the meta-data is not needed at page load, but off-line in the search engines' secret lairs.
(Hmm, now I'm adding link prev/next items to the head for clear sequences of pages. Not many pages, and the extra early page weight is not high...)
Having performed a month-long experiment using Cloudflare as a CDN for this site's static content, I've redirected that traffic back to the main (www) server, as only maybe 50--70% was being cached by Cloudflare, and the rest was slower and probably taking more resources (time, energy, carbon) overall to serve. A quick WebpageTest suggests that there is no apparent performance penalty for doing so for the typical visitor. There are a few extra connections to this server to support HTTP/1.1, especially thise pages with lots of inline objects (eg images) that would otherwise be multiplexed down a single (Cloudflare) HTTP/2 connection. I may need to tweak the local Apache to support a few more concurrent connections.
I have been larding the site up with structured data, including converting all pages to instances of schema.org/Article, which should be candidates for Google's "rich cards" (an extension of "rich snippets") in search, though there is no sniff of that yet. Google is reporting the structured data in the Webmaster tools console.
I'm not convinced that any of this microdata helps in any clear way, though I do like the idea of making meta-data less ambiguous and consistently available, and data sets more discoverable. (I don't see much evidence of direct benefit for SEO/SERP, other than appearance and helping the search engines understand content.)
Nominally the site now has 133 articles, including the home page and the HTML site-map page.
Having relented and added site maps for the main and mobile sites a couple of weeks ago, Google finally reports having indexed nearly all the main-site pages creeping up slowly from the initial (~90%) level already in place when the map was added, but for mobile pages is still stuck at 8 (out of 133). Having put all the canonical/alternate header links in before, I don't understand why the mobile site is everything or zero. Maybe it is effectively zero. Still, I'm getting ~25% of organic searches coming into the mobile pages, so...
Sources and Links
- Best practices for XML sitemaps & RSS/Atom feeds.
- JSON for Linking Data.
- Structured Web page data basics.
- The ExNet style guide which has adjusted gently since the mid '90s, and always aims to make pages work on restricted devices and slow connections, as well as full-featured desktop browsers.
- Helpful tools for testing site usability, compliance, etc, and minifying CSS, images too: