Data Mining Workshop Series
Data Mining Workshop Series
HathiTrust Workshops, Oct. 6-9
This virtual workshop series will introduce attendees to the tools, data, and services of the HathiTrust Research Center (HTRC). HTRC supports text and data mining projects using HathiTrust’s digital library, which includes 17.3 million items.
Spread over four days, each workshop will address a different aspect of text and data mining using HathiTrust data and HTRC services. Attendees are not required to attend all four workshops, and can pick and choose the events that best match their interests and schedules. If you are new to text mining and HathiTrust, it is recommended that you start with the introductory session.
Librarians who attend all four workshops will be invited to join a cohort of other librarians who are teaching with and about the HTRC. This cohort has access to additional support from HTRC, further training opportunities, and a community of peers who are interested in HTRC.
Individual Workshops
“Introduction to HTRC for Text and Data Mining,” Oct. 6
In this session, we will explore the basics of HathiTrust as a data source and demo how to utilize HTRC as a resource for text and data mining. The workshop will address the various tools and services of the HTRC, along with options for accessing data from HathiTrust for text mining research. The session will be helpful for those who want a general overview, or who want a solid foundation for the other workshops in the series.
Prerequisites: None.
“HTRC Extracted Features Dataset,” Oct. 7
This session will introduce you to the Extracted Features data model and the kinds of research it enables. HTRC recently released an updated version of the Extracted Features dataset (v.2.0) that includes 17+ million files, with each file representing a volume in the HathiTrust Digital Library. The Extracted Features files contain metadata about the volumes, as well as tokens (words), parts of speech, and their per-page counts. The dataset can be used for text analysis projects where access to the words and word-counts in a volume are expected by the algorithm, such as topic modeling or certain kinds of machine learning projects. This session will include a hands-on activity using the dataset.
Prerequisites: either the “Introduction to HTRC for text and data mining” workshop, or some previous experience with HathiTrust or HTRC.
“HTRC Data Capsules Environment,” Oct. 8
This session will introduce you to the HTRC’s capsule environment and how it can be used by intermediate and advanced researchers. An HTRC Data Capsule is a virtual machine with special security settings that allows researchers to access text data from HathiTrust, analyze it using the text and data mining methods of their choice, and then export only the results of their analysis. This session will include a hands-on activity using an HTRC Data Capsule.
Prerequisites: either the “Introduction to HTRC for text and data mining” workshop, or some previous experience with HathiTrust or HTRC.
“Supporting Text and Data Mining from the Library,” Oct. 9
This librarian-only workshop will feature a presentation and discussion of the skills, practices, and challenges of supporting text and data mining as a librarian. It will explore the rapidly-changing landscape for digital humanities and digital scholarship, and how librarians can best situate themselves to address the shifting needs of researchers.
Prerequisites: None. Librarians who complete this workshop in addition to the previous 3 will be invited to join a cohort of trainers who have gone through an HTRC workshop and have access to additional support for teaching with and about the HTRC.
Registration & Logistics
The workshop series is open to all Yale faculty, graduate students, postdoctoral researchers, librarians, and academic staff. To sign up, please fill out this brief registration form.
Each workshop will be held via Zoom and will include a mix of presentation, discussion, and hands-on components. We will use breakout rooms to support hands-on activities. You will not be required to install any software to participate in the workshops.
For questions, please reach out to the Yale Digital Humanities Lab, which is serving as a co-host for this workshop series, alongside the University of Maryland.