- Introduction
- Historical Manuscript Transcription, XML Markup, and Data Entry
- Other Data Services
- Quality assurance and control
- Prices and timescales
- Examples of my work
Introduction
I offer high quality historical manuscript transcription services at volumes and prices appropriate for large academic projects. I have over 16 years' experience of professional manuscript transcription work. Some of this work has been published at British History Online. I specialize in transcription of English language documents from the 16th century onwards, and can also extract data from some formulaic Latin documents. See the headings below for more details of services and prices, and published examples of my work. I can give free advice and estimates to help with project planning and funding applications, with no obligation to use my services if the application is successful. I am based in the UK but often work for overseas clients, especially in the US. Most of my work is for organizations, but I can also work for private individuals.
Historical Manuscript Transcription, XML Markup, and Data Entry
I can deliver very accurate full text transcripts of historical manuscripts according to any transcription conventions you specify. I can usually expand abbreviations if required. Transcripts can be delivered as plain text, word processor files, or XML. Markup can include the structure of the text, named entities, and dates. I am familiar with TEI P5 and Oxygen XML Editor.
Marking up names is very easy to do during first pass transcription and will add very little time. Manually marking up names later is more labour-intensive, and Named Entity Recognition software is likely to be less accurate than a skilled and experienced transcriber. Even in a plain text transcript, names can be marked up with XML tags or simple markup like wikilinks.
I can also enter structured data into a spreadsheet, database, or XML file. This is easiest to do if the original document is already structured, but I can also extract structured data from unstructured or semi-structured documents.
You will have to supply digital images of the pages to be transcribed, and get copyright clearance if necessary. High quality images are easier to use, but I can deal with whatever you've got, even if it's too difficult for HTR/OCR software or unskilled double keyers.
Other Data Services
I can also offer these services:
- checking and correcting existing transcripts.
- modernizing transcripts of early-modern English documents.
- cleaning or wrangling structured data.
- editing geodata in QGIS.
- writing XSL style sheets to transform XML into other formats.
Quality assurance and control
All transcribers inevitably make errors but I use the following methods to reduce the risk of making errors during transcription and to find and correct errors after transcription:
- using my experience of transcription and historical research to read letter forms, words, and abbreviations accurately, and understand the structure and purpose of documents. This helps me to avoid reading errors that unskilled transcribers would make, and to recognise possible errors during later checks.
- positioning windows and sizing text to reduce the risk of missing words or lines because of eye skip, which can be very difficult and expensive to track down afterwards. I am able to deal with documents that have very long lines, such as bills and answers in early-modern court cases.
- working carefully and not going too fast. This reduces keying errors. Shannon's communication theory says that speed and redundancy both affect the accuracy of transmission. Advocates of double keying privilege redundancy but I compensate for a lack of redundancy by reducing the speed. Be suspicious of any transcriber who quotes a high words-per-minute speed.
- a second pass to deal with difficult words which have been flagged in the first pass. This is more efficient than spending too long on a word at the first pass. The experience of transcribing the rest of the text often makes difficult words easier.
- text mining: using a script to construct a list of unique words, which can be compared with a dictionary if modern spellings are used, or browsed manually if non-standard early-modern spellings are used. This is the most efficient way to trap words that are likely to have been mistyped. Suspect words will be checked against the document images if necessary, but some obvious typos can be corrected without checking the images.
- smooth reading: reading the whole text without skimming words, but without carefully examining the spelling of words. This is the most efficient way to find dictionary words that don't make sense in context. My experience of historical documents helps me to notice words that could be wrong. Suspect words will be checked against the document images. For very large amounts of text, this check may be reduced to a sample or dropped altogether as it adds 10% to the transcription time.
- structured data will be checked using facets and filters in OpenRefine.
I do not use double keying, because it is always likely to be at least one of:
- Expensive: paying a fair wage to two experts to transcribe the same text, and a third expert to reconcile the differences, will more than double the cost for less than 1% increase in accuracy. Few projects can justify this cost in their funding bids.
- Exploitative: reducing the cost of double keying to compete with my single keying service depends on exploiting low-paid workers in the Far East. Projects that use outsourced double keying are unethical. Many universities now have supply chain policies that would not allow this method.
- Inaccurate: low-paid workers who lack the experience and language skills to understand historical documents, and who are made to work as fast as possible to save money, are likely to make many errors. Inexperienced transcribers are likely to make the same errors as each other, and double keying has no way of trapping these errors. Genealogy paysites are notorious for their poor quality transcripts. University procurement policies usually require high quality and will not automatically favour the lowest price.
Trying to fix one of these problems will inevitably make one of the others worse. Double keying is fundamentally flawed and should be avoided.
Checking existing transcripts
If you already have transcripts, I can help to improve their quality by checking for errors and making corrections if necessary. This may be especially helpful if the transcripts are intended for academic publication but there is too much text for you to check it yourself. I can use any of these methods, depending on what is most appropriate:
- text mining: using a script to construct a list of unique words, which can be compared with a dictionary if modern spellings are used, or browsed manually if non-standard early-modern spellings are used. This is the most efficient way to trap words that are likely to have been mistyped. Suspect words will be checked against the document images if necessary, but some obvious typos can be corrected without checking the images. This method is not useful for HTR output because HTR software is not expected to make any keying errors.
- smooth reading: reading the whole text without skimming words, but without carefully examining the spelling of words. This is the most efficient way to find dictionary words that don't make sense in context. My experience of historical documents helps me to notice words that could be wrong. Suspect words will be checked against the document images. This typically takes 10% of the time of manual transcription when used on my own transcripts, which have very few errors by this stage, but may take longer if there are large numbers of errors to be corrected.
- targeted A-B checks: checking all instances of high value data, such as names or dates, against document images. If the data to be checked has not been marked up, this method may have to be combined with smooth reading.
- full A-B checks: checking every word against document images. This is very labour-intensive and is likely to take half the time of manual transcription.
Prices and timescales
I usually charge an hourly rate, which will be fixed before the work commences. The rate will be negotiated according to the size of the job, the amount of skill required, and your project's budget. I am more likely to accept a lower rate for a larger contract that guarantees work for longer. Small jobs for private individuals will be charged at higher rates. I may quote a fixed price for some very small jobs. Before commencement, we must have a project plan in place that specifies what is to be delivered by when, and a maximum number of hours to be worked. I will work until I have completed the agreed tasks or the maximum hours, whichever is sooner.
Prices are subject to change in future because of inflation and exchange rates, but once we have agreed an hourly rate in a contract, it will not change. The contract and invoice will state the price in your own currency, so you will not be at risk from changes in the exchange rate. Typical current hourly rates are shown below:
| Currency | Minimum | Maximum |
|---|---|---|
| GBP | £18 | £28 |
| USD | $25 | $38 |
| AUD | $38 | $59 |
| EUR | €21 | €33 |
For text that I have transcribed myself, text mining and smooth reading will be included free of charge because these are effectively finding and correcting errors that I have already made.
I expect to average around 1,000 words per hour including the second pass, although this will vary according to the difficulty of the handwriting and quality of images. If I find the documents particularly easy to read, I may get up to 1,200 words per hour. I can commit to up to 25 hours per week. For documents with 250-300 words per page, this would mean about 4 pages per hour, or 100 pages per week. I could transcribe 1 million words in 1 year or less, which would cost around £20,000.
Modernizing an existing transcript can easily take as long as creating a new transcript, even if many words can be replaced automatically.
Checks on existing transcripts supplied by clients will be charged at a similar hourly rate to transcription but will take less time. Typical timescales would be:
- text mining: this is a quick and efficient method but the time taken is difficult to predict in advance as it depends on the number of unique words and the number of suspected errors. There are economies of scale, so the cost of text mining will not increase in proportion to the total word count. The error rate and variations in spelling have much greater effects on how much time is needed.
- smooth reading: typically takes 10% of the time of manual transcription when used on my own transcripts, which have very few errors by this stage, but may take longer if there are large numbers of errors to be corrected.
- targeted A-B checks: varies according to how many facts need to be checked and how easy it is to find them. If the data to be checked has not been marked up, this method may have to be combined with smooth reading and will take more than 10% of the time of manual transcription.
- full A-B checks: usually takes half the time of manual transcription.
Examples of my work
The identities of my clients and the work I do for them are kept confidential by default, but these clients have chosen to credit me on their websites or social media. These examples show that I am capable of producing high quality work suitable for academic research and publication within the budgets of AHRC and ESRC grants.
- 'The Power of Petitioning in Seventeenth-Century England' (Birkbeck University of London and University College London): transcription and basic XML markup of 2,200 pages of petitions (c. 600,000 words). The petitions that I transcribed for this project have been published at British History Online.
- Corpus Synodalium: a database of medieval church statutes compiled by Professor Rowan Dorin. I contributed around 991,000 words transcribed from printed Latin texts that were too difficult for OCR.
- '1624 Parliament project' (History of Parliament Trust): I transcribed the diary of Richard Dyott MP, which was published at British History Online. This was especially difficult work because the original manuscript is water damaged.
- 'Life in the Suburbs' (Centre for Metropolitan History and Cambridge Population Group): I transcribed St Botolph Aldgate burial registers from the 1580s to 1710s into a database which has since been published at SAS-Space and as part of London Lives. I also calendared indemnity bonds (partly in Latin) and tax records.