Author Archives: Dave Williams

About Dave Williams

Web & Digital Services Librarian at Queens College, CUNY.

A Digital Collection

As part of our ongoing experiments in visualization, I recently rebuilt a personal website using Omeka (or more specifically what they’re now calling “Omeka Classic”), a lightweight web publishing platform for digital collections from George Mason University’s Roy Rosenzweig Center for History and New Media. Previous exposure suggested that, by itself, it could make a decent digital asset management add-on to an existing site due to its metadata support. In this instance I wanted to discover the benefits Omeka offers when treating our corpus of plain text files as an online collection.

As it happens, there are a number of very nice plugins extending the functionality of Omeka, and several are specifically designed for text analysis and visualization. With the file content attached as an item type metadata (in this case the textual component), I was able to use the Ngram plugin to analyze term frequency and plot it over time.

Continue reading

Keywords in Context

One of the initial project goals was to provide a means for examining the text using keywords, or more specifically a variation on the tools designed for “keywords in context” (KWIC). This technique involves generating a concordance, a list of the words used within a text along with those words immediately surrounding them, as a means for providing context. Traditionally, generating a concordance was a labor-intensive process reserved for only the most important books, often source texts that serve as the foundation for religion, such as the Judeo-Christian Bible.

As a linguistic tool, examining the collocation of terms and their respective uses across a corpus can provide valuable research insights, and with the advent of digital text analytics applications generating these annotated corpora became significantly faster and more accessible.

It was our hope that, when applied to our collection of Marxist writings, the results could also have pedagogical value: By searching for a keyword of personal interest, perhaps drawn from current events or an unrelated research interest, a scholar unfamiliar with Marxism could gain insight from how the terms were applied, resulting in something of a “Marxist perspective” on the topic.

Continue reading

Improvements and Interactions

In addition to experimenting with new visualization techniques, Sebastian produced an improved version of the “Top 5 Terms” chart:
Revised Top 5 Chart
Clearer and less visually cluttered, this version makes it easier to distinguish which terms were used most frequently by each author within each document.

Particularly engaging is the 3D interactive force network diagram he produced using the network3D R-to-JavaScript library. It features correlations of 0.4 and higher, and you can interact with the nodes, rotating them to bring a particular work into focus while highlighting the connections between texts.  Based on this degree of interdependence, it’s interesting to see Clara Zetkin as an outlier, only connected to Georgi Plekhanov.

Please take a moment to visit and explore the diagram, and let us know what you think.

Network Explorations

Another possibility for visualizing relationships between texts is to consider how similar the combinations of related terms among them can be. To accomplish this, experiments were conducted generating network diagrams.

Author Network

This diagram uses pairwise correlations to examine the term frequencies in each author’s work, via the tidygraph and ggraph R packages. The nodes represent authors, with edge density indicating correlation strength — how strongly related variables are, using values between 1 (strongly related to the point of being almost identical) and 0 (not connected or dependent at all). To reduce visual clutter, only correlations higher than 0.4 are considered. Based on this criterion, one possible interpretation of the graph suggests that Kautsky used similar terms at a similar rate as Engels.

Continue reading

Initial Experiments

The initial data corpus for this project is composed of twenty-five files, formatted using UTF-8 (8-bit Unicode) character encoding as both plain text and XHTML. These were produced using materials hosted by the Marxists Internet Archive, through a process of carefully cleaning the content via regular expressions.  Extraneous presentational markup and line endings were removed, and replaced by semantic markup where applicable.

XHTML was selected because, as a markup language written according to the XML standard, it supports incorporating additional metadata. For this project we wanted to include information about the location where a document was originally written as well as the date, and this was accomplished using a set of elements defined by the Dublin Core Metadata Initiative. For example, Marx and Engel’s Manifesto of the Communist Party was created in Brussels, Belgium, in February of 1848, so the XTHML version contains the markup:

<meta name="DC.date" content="1848-02-21" />
<meta name="DC.coverage.x" scheme="Point" content="4.3333" />
<meta name="DC.coverage.y" scheme="Point" content="50.8333" />

Continue reading