Research in the humanities remains fundamentally text-based, especially when pioneering new fields or topics. I vividly recall my initial struggle with media theory texts as an inexperienced engineer beginning his Master’s. Additionally, I identify as a “visual learner” – typically finding it challenging to fully grasp content presented solely in text. I rely on graphics or visual elements to structure and format the text for me. (Footnote: The existence of such distinct learning types is still subject of an ongoing academic debate, which I won’t delve into here. However, from my anecdotal and unscientific self-observation, I find that mind maps, infographics, and images greatly aid my text comprehension.)

The following project aims to automate this pre-formatting using Natural Language Processing (NLP) and subsequent visual processing. The explicit goal is not to replace the process of reading, but to provide readers with an accessible entry point into the text. This project focuses specifically on typical media studies texts from the German-speaking humanities, a field typically not addressed by major players in machine learning.

assignment_late

Important Note

I consider these descriptions here as a work in progress, akin to a digital garden that documents my experiments like a lab journal. Therefore, it deliberately includes potential missteps, dead ends, or incorrect assumptions to make my research process transparent.

First proof of concept

For my first “proof of concept,” I chose a classic media theory text: Walter Benjamin’s Das Kunstwerk im Zeitalter seiner technischen Reproduzierbarkeit (“The Work of Art in the Age of Mechanical Reproduction”),1 written around 1935 with approximately 62,000 characters, analyzed using the Python tool spaCy. The model de_core_news_lg was employed for an initial experiment.

To see results quickly, I used spaCy to search the text for all nouns (NOUN) and catalog them in a Python dictionary. An insight: It’s not worthwhile to also output “Proper Nouns” (PROPN),2 as the model struggles with old German orthography, often misclassifying words it can’t categorize (e.g., “daß” or “bewußt”).

# Extract entities and nouns
terms = [token.text for token in doc if token.pos_ in ['NOUN']]

Currently, I’m trying to relate these nouns by creating a co-occurrence map using a Python dictionary. If the co-occurrence exceeds a threshold of 1, the script assumes a relationship between two nouns, which is later considered in the visualization process.

Cleanup

The model also assumes that roman numerals (›VI‹), elipses (›‹) and square brackets (›[123]‹) are nouns. Therefore I apply some regex (using Python’s built in re) to clean up the text and get rid of these letter combinations before applying the NLP:

# get rid of roman numerals (seem to confuse spaCy)
text = re.sub(r'(?=\b[MCDXLVI]{1,6}\b)M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})', ' ', text)

# get rid of citations and converted footnotes in the style of: test.‹[125]
text = re.sub(r'\[\d*]', ' ', text)

# get rid of elipsis as …
text = re.sub(r'…', ' ', text)

# get rid of newlines
text = re.sub(r'\n', ' ', text)

# get rid of two or more consecutive whitespaces & replace with single whitespace
text = re.sub(r'[\s\t]{2,}', ' ', text)

Visualization

For visualization, I currently use the Python library networkx to create a network diagram from the Python dictionary. Without further configuration, the graphic is not particularly useful:

However, zooming in reveals interesting results:

HTML and Interaction

Insight: A toolset is needed for more convenient zooming and interaction. The Python preview is adequate but somewhat clunky. Therefore, I use the Python library pyvis to create a HTML/CSS/JS-compatible webpage. With some styling and manual tweaking, it looks quite sophisticated:

Those interested can experiment with a demo at https://scinotes.org/benjamin. Please be patient: Network generation takes about 15 seconds, depending on your device.

TODOs

  • Currently, the NLP runs entirely on the CPU. With a switch to a new model, it should ideally be transferred to GPU using CUDA.
  • Currently, the process extracts ›just‹ nouns from the text, which yields surprisingly good results for Benjamin’s text, but can be significantly improved. In the medium term, the de_core_news_lg should be enhanced with a model specifically optimized for typical concepts in German media studies.

Learnings

  • The standard models de_core_news_sm and and de_dep_news_trf yield significantly worse results in regard to accuracy – i.e. they get often trapped with unusal composita and non-German words.
  • Although the models are trained on news and not on scientific texts the results look quite promising. However I have to look into the training process a lot more if fine tuning these models with scientific concepts produces meaningful outputs.

Footnotes

  1. Thanks to the great work of the guys at wikisource I was able to use the third version of the essay (authorized by Benjamin in 1939). The source can be accessed here. Reference: Walter Benjamin – Gesammelte Schriften Band I, Teil 2, Suhrkamp, Frankfurt am Main 1980, pp. 471–508. ↩︎
  2. A great list that explains spaCy’s POS abbreviations (POS = part of speech) can be found here. ↩︎