Text Analysis
Overview
Text mining includes a number of applications, from tracking textual reuse and the fluctuation of certain words or themes over time to stylometry and the modeling of literary forms. These texts might represent a single author’s oeuvre, a periodical’s full print run, or a collection of texts from across multiple centuries. Curating a dataset for text analysis often entails digitization and optical character recognition (OCR), the process of turning words from images into searchable text.
Methods & Tools
Out-of-the-box tools that don’t require any custom programming include Voyant, DataBasic, AntConc, Topic Modelig Tool, JSTOR Labs’s Text Analyzer, and Wordle.
Tools that require minimal command line usage include Stanford’s CoreNLP (which can identify people, locations, and even sentiment in unstructured text) and Andrew McCallum’s MALLET (which generates topic models).
Platforms that require some programming include Bookworm (for tracking a word’s frequency over time) and Andrew Goldstone’s dfrtopics (for visualizing topic models).
Popular packages that require more intensive levels of programming include the Natural Language Toolkit, Stanford’s Stanza, and the Python libraries: datasketch, NumPy, SciPy, and scikit-learn.
Datasets
Alan Liu’s curated collections
McGill .txtLab’s curated collections
Licensed data (includes a lot of historical newspapers)
Recommended Readings & Tutorials
Stanford Literary Lab Pamphlets
“Seven Ways Humanists are Using Computers to Understand Text” by Ted Underwood
“Introduction to Named Entity Recognition” tutorial by W.J.B. Mattingly
The Programming Historian lessons on web scraping through text analysis
How do I get started?
If you're new to digital humanities and are interested in starting a project, stop by the Franke Family Digital Humanities Laboratory in Sterling Memorial Library during our Office Hours.
We also highly recommend looking at our Project Planning and Design Toolkit to learn about the steps involved in a typical project life cycle. In addition to projects at Yale, please check out projects at other digital humanities centers, including:
- Stanford's Literary Lab
- Northeastern's NULab for Maps, Texts, and Networks
- Maryland's Institute for Tecnology in the Humanities
- DHCommons Projects
Resources
Along with providing consultations during our weekly Office Hours, the Digital Humanities Lab offers a number of awards to support digital humanities research.
In addition to on-campus support, there are also off-campus and online resources that you might try. The following programs all offer opportunities for researchers to learn different digital humanities methods and theoretical approaches:
What we offer