Projects

Corpora, models and tools — mostly around poetry, emotion and digital ecocriticism.

GitHub — @tnhaider ↗

Deutsches Lyrik Korpus (DLK) ↗ Corpus ★ 20 DLK
A large corpus of New High German poetry, built by collecting and parsing the bulk of digitized public-domain German verse — chiefly the Deutsches Textarchiv and the TextGrid Digital Library. The current v6 holds ~65,700 poems by ~250 authors (over 12 million tokens), segmented to syllable level. It underpins much of my distant-reading work on verse.
PO-EMO — Aesthetic Emotions in Poetry ↗ Python ★ 13 poetry-emotion
A corpus of German and English poetry annotated line-by-line for the aesthetic emotions it evokes (beauty/joy, sadness, suspense, awe, and more), with prosody annotation and modeling code — 158 German and 64 English poems double-annotated, plus a Chinese pilot. The dataset behind the PO-EMO paper.
Metrical Tagging in the Wild ↗ Python ★ 5 metrical-tagging-in-the-wild
German and English poetry corpora annotated with prosodic features — syllable stress, meter, measure, caesura — together with the neural taggers trained on them. Code and data for the EACL 2021 paper.
German Rhyme Corpus ↗ Corpus ★ 4 german-rhyme-corpus
A diachronically balanced sample of German poetry, manually annotated for rhyme — notably, about a third of stanzas don’t rhyme at all. Encoded in TEI P5; the resource behind the supervised rhyme-detection paper (SIGHUM 2018).
Poetry Corpus Building ↗ Python ★ 3 poetry-corpus-building
Tools for scraping, cleaning, and assembling large poetry corpora from heterogeneous sources.
English Gutenberg Poetry ↗ Corpus ★ 4 english-gutenberg-poetry
An English poetry corpus mined from Project Gutenberg (via GutenTag) and prepared for computational analysis.
Poetry Gold ↗ Corpus ★ 2 poetry-gold
A diachronically balanced, hand-selected gold corpus of English poetry for evaluating automatic analysis.
Antikörperchen Poetry Corpus ↗ Python antikoerperchen-german-annotated-poetry
Canonical German poems paired with student interpretations and several annotation layers, crawled from the Antikörperchen poetry site. (See also the related “antik” repository.)
EPG64 — Annotated English Poetry ↗ Corpus epg64-english-poetry-annotated
A small reference set of English poems with manual gold annotation.
XML Poetry Tooling ↗ Python xml-poetry-reader
Parsers for reading poetry encoded in TEI/DTA-conforming XML into analysis-ready formats. (See also “poetry-api-xml-dta”.)
Narrator Semantic Change ↗ Jupyter narrator_semantic_change
Data and code behind “The Ongoing Birth of the Narrator” (DSH 2025): annotation guidelines, the gold-labeled instances, and the model folds used to trace the emergence of the author–narrator distinction in literary criticism.
NYT Fiction Bestseller List ↗ Data nyt-fiction-bestseller-list
A dataset of titles and authors on the New York Times fiction bestseller list, 2000–2020 — the basis for the “More Social, Less Religious” trend study.
Literotica Corpus ↗ Corpus ★ 11 literotica-corpus
A corpus of over 110,000 erotic fan-fiction documents crawled from literotica.com, organized by user rating and genre — assembled for computational study of genre, style, and narrative.
Web Crawlers ↗ Python ★ 4 crawler
A set of web crawlers used to build text corpora — for the Antikörperchen poetry site, Literotica, and MGG-Online (musicology).
DHd2024 Book of Abstracts ↗ HTML DHd2024-BoA
Scripts that turn TEI-encoded XML into the printed PDF Book of Abstracts for the DHd2024 conference in Passau, which I co-organized (adapted from a TEI-to-PDF pipeline).

GitLab · Universität Passau — @haider ↗

Bundestagsprotokolle ↗ teich/bundestagsprotokolle
A corpus and processing pipeline for German parliamentary proceedings (Bundestag Plenarprotokolle), underpinning our work on linking parliamentary speakers to Wikidata and on German nuclear-energy discourse. (A collaboration with M. Teich; repository internal.)
DHd Citation Network ↗ gassner/dhd_citation
Citation, co-authorship and keyword network analysis of the DHd conference abstracts (2014–2023) — the code behind “Quo Vadis Fachbereiche und Schulen der DHd.” (A collaboration with S. Gassner; repository internal.)
Multilingual Valence Scaling ↗ multiling-valence-scaling
A curated collection of psycholinguistic valence (affective pleasantness) rating datasets — 37 datasets across 13 languages, mapped to Concepticon via NoRaRe — used to evaluate LLM valence scaling with Best-Worst Scaling across 16 models (0.5B–671B parameters). The basis of the forthcoming valence-norms paper.
Ecocriticism Bibliography Pipeline ↗ bib_pipeline_eco
A pipeline that extracts plain text from PDFs and uses a local Mistral model to pull book metadata — and, soon, literary mentions — from a corpus of secondary literature on ecocriticism, toward mapping the “ecocritical canon.”
EcoCor — Corpus Extraction ↗ ecocor_extraction
Code and data for retrieving entities and texts from the EcoCor API — part of the open EcoCor infrastructure for digital ecocriticism introduced at DH2026.
Enjambement in German & Spanish ↗ enjambement-spanish
Code and data for detecting enjambement — line breaks that run against syntax — in German and Spanish poetry. Supports the forthcoming CHR paper. (Repository is internal.)
EcoCor — Modeling ↗ ecocor_pred
Modeling and prediction over the EcoCor ecocriticism corpus. (Repository is internal.)
EcoHack Modeling ↗ ecohack_modeling
Modeling experiments around ecocriticism and the environmental humanities. (Repository is internal.)
Valence Scaling ↗ valence_scaling
Earlier experiments on automatically scaling affective (valence) norms for words. (Repository is internal.)
hopecore ↗ hopecore
(Description to come.)