Digital techniques and data
The images of the relevant pages of the Northern Star were run through an Optical Character Recognition program (Abbyy Finereader 12) and the resulting text was checked manually.
We developed a set of Python codes to extract and geo-code the place of meeting, using a gazetteer of places, and parse the date of the meeting.
The code and data is freely available on the BL Labs github page: https://github.com/BL-Labs/meetingsparser
Here is a slide show explanation of the project and basic methods:
The historic map used on this website is the 1st edition OS map from 1885, courtesy of the National Library of Scotland. So it's a bit anachronistic as many of the places had expanded quite a lot in the preceding 40 years. We have also overlaid a map of Oxford Street, London, from the British Library collection.
I've geo-coded the exact addresses for meetings in London and Manchester. For the other towns and villages, currently a central geo-location is used for all the meetings, but we hope to update these soon to the exact addresses.
Geo-parsing and dating political meetings
The first stages of coding for the Political Meetings Mapper project were relatively straight-forward and didn’t involve developing anything majorly new.
We extracted the place names from the texts of the ‘forthcoming meetings’ columns. I compiled a historical gazetteer to geo-parse the places where the meetings were held. The town names obviously still exist, so I just ran the list through an online geo-coder. But many of the individual pubs, halls and some streets of the 1840s so I had to employ my historical skills to find these manually, using historic trade directories and the geo-referenced historic town plans layered on Google Earth to find the correct geo-co-ordinates. We then used some more Python code to append the correct latitude and longitude to the list of places in the texts.
And next we dated the meetings using Python code for working out the dates using the date of the newspaper (which was always published on a Saturday) and some regular expressions.
Part of the bigger aim of the project was to help the British Library improve its classification of digitised newspaper collections.
So how do we find the ‘forthcoming meetings’ column in the newspapers without looking for it ‘by eye’? Our task therefore was to classify columns in the original xml of the newspaper pages.
We built a pilot classifier to do the following steps:
1) Classify the elements making up a meeting report (e.g. the report generally starts with a TOWN NAME in capital letters followed by a full stop and a space. The second sentence often start with ‘On [day]…’
2) Run this test classifier through our ‘clean’ and corrected meetings data to see if it identifies the columns as meetings reports
3) If successful, run the test through a sample of text that we know does not contain any reports of meetings, to ensure it doesn’t find anything.
We’ve obviously still got to refine the classifier to make it more accurate. Building a classifier requires a lot of trial and error, and patience! If you are interesting in helping with the Machine Learning experiment, do get in touch.
Members of the project team:
- Dr Katrina Navickas, Senior Lecturer in History, University of Hertfordshire
- Ben O'Steen, Technical Lead, British Library Labs
- Mahendra Mahey, project manager, British Library Labs
- Further technical and coding assistance provided by John Levin
- OCR correction assistants: Samantha Walkden and Megan Dibble (graduates in History from the University of Hertfordshire)