Tesseract was used to performed Optical Character
Recognition (OCR) to convert the
original images to searchable/mineable text files. Tesseract is an open-sourced OCR engine,
originally developed at Hewlett-Packard
Laboratories Bristol but now being developed and managed by the Github community. Tesseract was
trained for this specific corpus by
Quyen Ha.
Text mining processes such as topic modeling, word frequency analysis, correlation analysis, and
key-word-in-context analysis were done
using R codes. These codes were written by Crystal Hall, adapted from codes provided by Matt Jockers
in "Text Analysis with R for Students
of Literature" (Spring, 2014), and modified by Quyen Ha.
Interactive topic modeling visualization was created with LDAvis. LDAvis is a set
of tools to create an interactive web-based visualization of a topic model that has been fit to a
corpus of text data using Latent
Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various
summary statistics as input to an
interactive visualization built with D3.js that is accessed via a browser. LDAvis is created by
Carson Sievert and Kenneth E. Shirley.
LDAvis created its visualization using the D3 Library (version 3.0).
D3.js is a JavaScript library
for producing dynamic, interactive data visualizations in web browsers. It was developed by Mike
Bostock, Jason Davies, Jeffrey Heer,
Vadim Ogievetsky, and others.
This website's theme was created using Bootstrap.
Bootstrap is a free and open-source
front-end library for designing websites and web applications. It contains HTML- and CSS-based
design templates for typography, forms,
buttons, navigation and other interface components, as well as optional JavaScript extensions.