Digitising health statistics

Christy Henshaw
Stacks
Published in
7 min readJun 27, 2018

--

Wellcome Collection has the UK’s most comprehensive set of Medical Officer of Health (MOH) reports that chart the health of Britain’s population from 1848 to 1978. Medical Officers of Health were medical doctors appointed by local councils with the aim of gathering data and sharing findings with their local government officers. The reports contain narrative text as well as tables describing and charting the health of their respective districts including birth/death statistics, disease occurrences, food inspections, sanitary conditions, and much more.

Medical Officer of Health reports, Wellcome Collection

70,000 reports have now been digitised and made freely available online creating a rich resource for public health and related research. We estimate this collection will include between 2-3 million tables of data. We expect to complete digitisation by autumn of 2018.

The reports are already heavily used but digitisation greatly increases their availability. Full-text searching means you can find information about food inspections, or school health provisions, or whether parrots were considered something to worry about in the nation’s ports. The statistical data is also searchable, so you can find images with tables that show mortality data for a specific parish, or discover how many reports include statistics on occurrences of a certain disease in a particular decade.

The current digitisation effort follows on from a pilot project part-funded by Jisc to digitise and OCR all our London reports back in 2012/13. We outsourced the digitisation of 5,500 reports and extraction of 270,000 tables of data. The supplier carried out manual quality control and correction of the OCR and markup. For example, tables that were not identified by the software (Abbyy FineReader Engine 11) or were mis-identified, or wrongly laid-out were manually corrected. This resulted in a highly accurate data set with markup we could use for rendering tables online on our dedicated site, London’s Pulse. This level of manual intervention was expensive and also prone to certain issues unlikely with fully automated processes such as XML validation errors. It is therefore not feasible to continue with this approach for the rest of the collection.

Example of a table from a London MOH report as rendered online.

Now that we’re digitising the remaining 70,000 reports we need to consider our options for providing access to data in millions of tables. Although we can provide access to tables as downloadable XML files, CSV files and HTML text via the London’s Pulse site, this does not scale to the entire collection. Our main discovery interface for Library collections, which includes a full-text search via Encore, is not able to support this type of access at all currently.

We have started to explore this challenge by commissioning three data researchers based at The University of Salford’s Pattern Recognition and Image Analysis Research Lab (PRImA) to investigate the opportunities we have to support digital humanities and medical research through retrieval, integration and dissemination of statistical data captured via digitisation and OCR.

The Brief:

  • Scope the data available in the reports and report on key characteristics
  • Provide an overview of who else is doing similar work
  • Find out what the users want
  • SWOT analysis
  • Provide some recommendations for a minimum viable product and how we might build on that in future

The final report, authored by Justin Hayes, Christian Clausner and Apostolos Antonacopoulos, was delivered in June 2018 accompanied by a sample data set and results of a survey of researchers. Their work was based on methodologies and software developed in a number of digitisation projects. This post summarises some of the work and key results, and the full report can be accessed on GitHub.

Scoping the data

In order to understand the data contained in the collection the authors used our existing XML documents from the London reports as a sample set. They ran a textual similarity analysis to identify common patterns that could be used to classify tables by topic area based on an emergent ontology. For example, demographic information can be identified by words such as “age”, “sex”, “birth”, “death”.

As part of our need to look at non-manual processing, they carried out an accuracy test of complex table markup comparing the raw OCR with the corrected XML. They found only a small reduction in accuracy by using the raw OCR, which indicates that expensive manual correction may not be warranted across the board. However, this was a limited test, and we will need to do investigate this further.

Analysis of table layouts, example from the report, p.13.

What are other people doing?

There are a few comparable resources with varying levels of access and availability of data, including Statistics Sweden, Histpop, Digital Reich Statistics, and Digitised Collections on Stats New Zealand. London’s Pulse is comparatively feature-rich, but there is more we can learn from others’ examples.

User survey

We received 15 responses to our survey from mainly medical data researchers. The results showed a clear preference for 4 topics that are generally covered by even the briefest MOH reports: basic demographics, mortality/causes of death, ailments, and fertility. 10 users were interested in using the quantitative data in our reports and these users thought that us combining the data as an integrated resource would make this easier. Although the user pool is small, it gives us some insight into what data researchers would find valuable. You can see a PDF version of the entire survey form on GitHub.

Medical Officer of Health reports in the stacks. Credit: C. Henshaw.

Working with the data

Once the tables are able to be classified and defined, we can then extract the data from the text files. Accuracy is paramount here — there needs to be a high degree of trust in the results, even if the images are available for comparison. There are options for automatically validating the accuracy of the data, such as within-table summations (e.g. comparing row or column totals). The trick will be refining the recognition and automated quality control process to narrow down the scope that may require manual correction.

With the data identified, extracted and described there is potential to provide researchers with an integrated data resource. This would require a technological solution (a suitable platform), and a suitable data model that can be standardised but also describe the full range of data.

We would need to consult domain specialists to ensure we select and develop ontologies that fit their needs. Due to the variable nature of the reports over time, over distance, and between individual’s reporting habits (there being little standardisation of the reports across the board), developing the data models is a big task. Understanding what topics are of most interest to users would allow us to prioritise our efforts, e.g. starting with cause of death rather than food inspections, or sanitary conditions.

An example of what that combined data might look like as an Excel document is available on GitHub.

Recommendations

A guiding principle for the recommendations is that we want to understand what a minimum viable product (MVP) might look like, based on the data at hand and what we know so far of the user needs. From that point, we would ideally continue to develop the resource in iterative stages that allow us to release useful functionality as and when it is available.

MVP is about discovery. Users need to be able to find the tables they need from across the entire collection. Using a high-level ontology to classify the different types of tables would improve search and browse (e.g. you could do a search for tuberculosis only in cause-of-death tables, because you’re only interested in how many people died of tuberculosis). The key challenge here is analysing the raw OCR for the non-London reports to ensure we have identified all the tables in the collection.

Stage 2 is about providing downloadable tables. This is already possible for the London reports where data has already been extracted. For the non-London reports, we need to find a scalable and affordable process to accurately recreate the table data and structure in usable formats such as XML and CSV. This stage will require considerable experimentation and prototyping to automate the quality control process as far as possible. We will need to work out how much manual work is required, and whether we can reduce it to an affordable level. If we can achieve this, the entire collection would be on a par in terms of data access.

Stage 3 is about providing access to combined data. This can only happen once Stage 2 has been carried out for the collections, but it is possible we could start to prototype this using the London report data before extracting tables from the rest of the collection. However, unless we are sure we CAN feasibly extract the table data from the entire collection, there is a risk we would waste time working on a solution that would only be applicable to 10% of the collection.

‘Hints from the health department’
Wellcome Collection, SA/SMO/R/4/1–19

Next steps

We will complete digitisation of the UK-wide MOH collection by September/October 2018, and will digitise a small collection of reports from former British colonies by the end of 2018. Once we have catalogued and digitised every single report we own, we can start to identify gaps in our holdings and look for opportunities to fill those gaps by digitising reports held elsewhere.

Based on the work done by Justin, Christian and Apostolos we can start to think more about what further information we need to gather (such as learning from current work on research data management), what further data analysis we need to do (such as more testing around accuracy of OCR results), and how we can engage users to ensure we are able to address their needs appropriately throughout the process.

Handy links:

Hitlist of MOH reports on the Wellcome Library website.

MOH reports on the Internet Archive (does not include London reports) https://archive.org/details/medicalofficerofhealthreports

London’s Pulse http://wellcomelibrary.org/moh/.

--

--