What is Corpus Creation?
Corpus creation is the process of building a dataset. For a digital humanities project, this often entails either finding a collection of texts or images online or digitizing physical holdings.
Tools for cleaning and processing texts
You have a digitized corpus, now what? The answer to this depends on what your data looks like and what your research questions are. For a few possibilities:
- For most typescript material, you could run it through optical character recognition (OCR) software. OCRing texts turns them from images into searchable, machine-actionable texts. One of the leading tools in this area is ABBYY FineReader. While not free software, ABBYY is available on two of the computers in the Digital Humanities Lab. Depending on the quality of the scan and the type of document you’re working with, some hand cleaning post OCRing may be necessary (for example, ‘l’ may be misidentified as ‘1’).
- For handwritten material, humans are still the best transcribers. Depending on the size and content of your corpus, crowdsourcing could be an option.
- For messy tabular data, you could use OpenRefine, a free tool for cleaning and transforming textual data.
How can the DHLab help?
Digital Humanities Lab staff can advise on strategies for building and cleaning your corpus during our weekly Office Hours. We also regularly offer workshops that are relevant to corpus creation. Visit our Workshops page to learn about what’s coming up and our GitHub page for tutorials from past sessions. For information on the use of our scanners for corpus creation, as well as information on databases that are already available for text and data mining, please visit our Data Resources page.
How do I get started?
If you're new to digital humanities and are interested in starting a project, stop by the Franke Family Digital Humanities Laboratory in Sterling Memorial Library during our Office Hours.
We also highly recommend looking at our Project Planning and Design Toolkit to learn about the steps involved in a typical project life cycle. In addition to projects at Yale, please check out projects at other digital humanities centers, including:
- Stanford's Literary Lab
- Northeastern's NULab for Maps, Texts, and Networks
- Maryland's Institute for Tecnology in the Humanities
- DHCommons Projects
In addition to on-campus support, there are also off-campus and online resources that you might try. The following programs all offer opportunities for researchers to learn different digital humanities methods and theoretical approaches:What we offer