Text mining includes a number of applications, from tracking textual reuse and the fluctuation of certain words or themes over time to stylometry and the modeling of literary forms. These texts might represent a single author’s oeuvre, a periodical’s full print run, or a collection of texts from across multiple centuries. Important to any texutal analysis project is the curation of a dataset, which often entails digitization and optical character recognition (OCR), the process of turning words from images into searchable text.
Methods & Tools
Tools that require minimal command line usage include Stanford’s CoreNLP (which can identify people, locations, and even sentiment in unstructured text) and Andrew McCallum’s MALLET (which generates topic models).
Popular packages that require more intensive levels of programming include the Natural Language Toolkit and the Python libraries: datasketch, NumPy, SciPy, and scikit-learn.
“Seven Ways Humanists are Using Computers to Understand Text” by Ted Underwood
How do I get started?
If you're new to digital humanities and are interested in starting a project, stop by the Digital Humanities Lab in Sterling Memorial Library, room 316 during our Tuesday or Wednesday Office Hours.
We also highly recommend looking at existing digital humanities projects to get a sense for what's possible. In addition to projects at Yale, we recommend checking out projects at other digital humanities centers, including:
- Stanford's Literary Lab
- Northeastern's NULab for Maps, Texts, and Networks
- Maryland's Institute for Tecnology in the Humanities
- DHCommons Projects
In addition to on-campus support, there are also off-campus and online resources that you might try. The following programs all offer opportunities for researchers to learn different digital humanities methods and theoretical approaches:What we offer