Earth Notes: A Workshop on Energy Data Best Practice (2019)Updated 2020-06-12 19:53 GMT.
What does data best practice look like? How can we help potential users discover, search and understand data? Can metadata make data more accessible?
Energy Systems Catapult (ESC) is working with BEIS, Ofgem and Innovate UK to develop best practice guidance for energy data and we need your input!
The maximum value of data can only be realised when potential users are able to discover it, search for related datasets and understand the content of data. The Energy Data Taskforce made recommendations for the industry wide adoption of a lightweight metadata standard and the development and deployment of a Data Catalogue to help address some of these issues.
(Meeting started at ~13:30.)
(~100 attendees at the Conference.)
A lot of attendees are from the 'traditional' energy sector (maybe over half) but also maybe nearly as many on the 'digital' side, with overlap.
This a continuation of work across Ofgen/BEIS/etc running for a while.
Eg, the Innovate UK competition on Modernising Energy Data Access.
Ofgem as regulator has many interests from economics and security to data use. It also wants to promote innovation and new business models.
What should producers of data be doing to maximise value and innovation?
Presumption of open data.
If the energy sector can do this right (and it's certainly not a done deal yet) then it makes it easier to roll out best practices to other sectors.
Net Zero is going to need lots more intermittent generation and more flexibility and that will require good data sharing. And doing that bi-laterally would be a nightmare. (Cf example of ISDA master agreements in finance.)
And somehow that this all has to be done cost-effectively.
Q from floor: is regulated sector and non-regulated covered by this?
A: likely both, eg interacting with regulated sector can use its best practice.
Q: what is "user"?
A: can be a person, but could be wider.
Q: is this all about only energy system data?
A: not just, eg has to sensitive to privacy issues too.
Q: have 'you' been following other countries, industries (eg oil and gas)?
A: yes, but do pitch in with good examples and avoid reinventing the wheel.
Q: is there need for government intervention, given good examples such as Google Maps that just work without?
A: this is not about intervention, but rather agree best practice.
Q: is this primarily about primary data, or post-processing?
A: both (eg if a company adds value by processing, maybe not open?)
(I'm now co-opted as note-taker on our table!)
(Academics mainly in our group...)
On our table:
- Jack Kelly
- Jamie Taylor
- Andrew Roberts
- Myriam Neaimeh
- Grant Wilson
- Eoghan McKenna
- Joel Ang
- Damon Hart-Davis
What does discoverable mean? Really difficult to know what that is or should be or do it uniformly. A start would be to have an obvious point of contact within each org to access to data. Just agreeing an ontology is a nightmare. But getting it entirely automatable without a human in the loop is be magic. (Queryable like the HTTP Archive?) A place like the UK Data Archive is a good place go to find static data sets now. A Data Provision Officer? How far does the automation need to go? Note static and updating data sets, eg static may be as simple as a CSV dump.
On metadata: can someone other than the data owner edit/fix the metadata? Eg an editable wiki that fronts the real data set. Implies space for official metadata from the owner, and the 'community' edition. Applies to structural and descriptive. May be a good idea to centralise and standardise. Or may mean things never happen. For big orgs that have data, but no time to curate, they can dump it out, and have someone else mark it up / add metadata.
Good practice point: always publish units!
(Some big data archives: UK Data Archive, London Data Archive, Administrative Data Research network, Zenodo, UK Energy Data Centre.)
Glossary of terms (ontology?) valuable? Who should own it? When does a (innovative) new term get in? How we deploy it? What about applying to legacy datasets?
Would be great to have set of common terms, maybe seems unlikely to be universal. But may be possible to encourage if (eg) a power unit, end it W (watts); energy unit in Wh (watt-hours). Thus may be possible to get more standardised on a core set of data / units. (Hungarian notation for data columns!) Sort of at the dimensional level. UTC ISO 8601 date stamps.
(Add an SI unit to schema.org Dataset 'variable measured'?)
It would be very good if there was a minimum set of data variables needed eg for EV charging without setting fire to cables, and more allows fancier stuff. What should the minimum set of minimums be? EVs, distributed microgen, distributed storage, ... ?
Existing energy operators are not withholding data wilfully, but beyond privacy, don't know what to publish or how, yet.
Privacy-sensitive data (eg home-owners unaggregated half-hour data) should NOT be available in open search. Raw data can be sensitive, then there can be successively more-open derived data sets.
Like credit cards should 'anonymise' in the same agreed way, eg always show last 4 digits, not a random part of the card number that could allow multiple different 'anonymisations' to be stitched together and defeat intent. Geographic-based data can clumped to avoid singling out single contributors. This for security but complicates search and analysis. Should try to be consistent across data sets to help with cross-system use, search, security. Even just use consistent formats (eg, ISO 8601 dates, lon/lat not eastings/northings).
BEIS should not own the glossary. OpenSteetMap is an example of how to do it well at a community level.
"Data should be accurately described with industry standard metadata."
Eg including Dublin Core as possible form, capturing at least a good title. A separate file means that the metadata can be public even if the full data isn't, for example.
(What about schema.org, microdata vs JSON-LD?)
At least a couple of us hate XML (inefficient). Prefer YAML, JSON?
Also supply some sample data.
(DHD note: I'm not necessarily agreeing with the suggested definition of metadata, eg not just a single descriptor of the dataset, since IMHO it is a layered thing and next level down could be columns, could be geotags of subsets, etc.)
Format may not be an issue where data is presented through an API, ie is not the sole/archival form.
UK Data Archive has decent description of the data sets. Includes README file, including such delights as 'column X is rubbish, do not use'.
Version control of metadata so you know that you have the right, current stuff. Careful updates/extensions may be able to avoid breaking old code and old uses. Maybe should include the revision cycle, eg "max-age" on an HTTP-served HTML page.
(What happens if people want their parts of a dataset to be deleted, should descriptions of deletions be part of metadata? Can reproducability be protected anyway? DOI numbers for data? Maybe ORCID IDs for contributors?)
"Data publishers should make sure that data that is made available has the appropriate descriptions and supporting information to make sure the data understandable for potential users."
Should we work harder for open data (units, methodology, sample/model code, etc).
Is metadata (at the level suggested above) enough? What about use of ontologies, semantic Web, ... ?
Start with some basic practices for timeseries data for energy, then follow some basic rules, such as ISO 8601 timestamps, units present?
Maybe note even when datasets are or are not in particular standard formats.
Some data carriers are broken (like Excel OADate with DST and timezones). Some publishers eg don't care about local vs UTC, and don't realise it may matter to other users. Open or closed time intervals, MIDAS weather data are measured 10 minutes off the end of the nominal hour. Heating Degree Days (HDDs) are in local time.
Reference data may be very hard, such as weather-correcting gas consumption, anonymised using method X, calibation data. May be hard to standardise, especially for methodologies. When in the hour is an hourly measurement taken, is it weighted from multiple samples, etc... ?
What basic things could we do better so in future so that new people can spend time earning new, different and more interesting scars?
Have a person point of contact. There to make the data more understandable, over time, or maybe even via a community metadata wiki? All responses should be semi-automatically published, eg like Stack Overflow. A data-skinned version? See openmod.
Have some Energy Data Taskforce standards that the Data Provision Officer above can talk to regulators, data users, etc, about. "We are working towards standard XXX?"
Cannot be too onerous as has to work for for the entire continuum from major energy company down through SMEs to academics and individuals. (Where will innovation come from?)
The Energy Data Taskforce should publish a data best-practise tutorial 101, plus follow-ons, on the Web site. All staff responsible for creating data should read this first. Also provide good examples.
Have a Good Data Guide and Good Data Awards and make doing it right a positive thing. Carrot, not just stock.
Comments From the Floor after Discussion
Why not have an independent Energy Data Trust that interacts with the regulator and industry, but is not them, and is the focus for data (and metadata) requests and standardisation for example. Maybe could be a repository for (some) of the data.
To make use of the data, Dublin Core is maybe too high level.
A "find similar data" tool would be useful, like Google image search against an existing image.
Clear use cases for develooping common terms of a taxonomy. Citations and what data is used for.
"XML bad, JSON good."
Organisations will generally commercially do the minimum to be compliant for cost / time reasons. Some data sources are currently forbidden from being Data-as-a-Service providers.
Could there be an Energy Data Taskforce stamp/tag for data sets? Maybe doesn't need regulation. Making data better shouldn't be made a burden, not be seen as one. Good news stories, shiny award ceremonies!
Quite a stretch to align the interests of regulator and industry (for example), and will need some leadership.
Good test: if I looked at this data in 10 years' time would I understand it? Data dependencies and links are important too.