Enhancing Knowledge Extraction from Violent Ancient Historical Texts through Fine-Tuned Large Language Models and Historical Databases

URL
Dokumentart: Master Thesis
Institut: Fachbereich Informatik
Sprache: Englisch
Erstellungsjahr: 2024
Publikationsdatum:
Freie Schlagwörter (Deutsch): LLMs , Deep Learning , Maschinelles Lernen , Wissensextraktion , Digitale Geisteswissenschaften
Freie Schlagwörter (Englisch): LLMs , Deep Learning , Machine Learning , Knowledge Extraction , Digital Humanities
DDC-Sachgruppe: Informatik
BK - Klassifikation: 54.72

Kurzfassung auf Englisch:

This thesis explores the application of fine-tuned large language models (LLMs) for automating the detection and classification of violent events in ancient historical texts. The primary research objectives were to determine whether LLMs could accurately identify violent versus non-violent texts and classify them across multiple dimensions of violence, such as level of violence, contextual background, underlying motives, and long-term consequences. Given the significant manual effort involved in annotating historical texts for violence, this research aims to alleviate that burden by automating the process using machine learning models. The thesis is structured around two core experiments. In the first experiment, we fine-tuned BERT and RoBERTa models on manually annotated examples of violent and non-violent texts. The datasets utilized were derived from historical texts curated in the ERIS and Perseus digital humanities databases. The fine-tuned models demonstrated high accuracy in identifying violent passages, significantly outperforming general-purpose models like GPT-4omini. Fine-tuning these models on domain-specific data led to superior precision and recall scores. In the second experiment, we expanded the scope to multi-class classification, categorizing violent events into four dimensions: level of violence (interpersonal, intrapersonal, intersocial, intrasocial), context (political, military, social, etc.), motive (strategic, emotional, religious, etc.), and longterm consequences (death, conquest, plunder, etc.). The results highlighted the models’ ability to handle complex categories with high performance in frequently occurring classes, though challenges persisted in distinguishing more nuanced or low-frequency categories. Despite the overall success, limitations arose due to the inherent complexity of historical texts. Historians often rely on extra-textual knowledge, that is, insights beyond the text itself, to interpret nuanced contexts. For example, understanding the Roman Senate allows historians to infer that a debate in this setting involves senators, even if not explicitly stated. Similarly, knowledge of cultural norms, like the symbolic meanings of Greek rituals, or awareness of an author’s biases, helps historians add depth to their interpretations. Social hierarchies also inform inferences, such as identifying a high-ranking official when a text mentions someone giving orders. These contextual layers are challenging for LLMs, which primarily rely on the explicit content of the text without access to such background knowledge. Such subtle interpretations remain challenging for LLMs, which primarily rely on the text provided. Additionally, the classification of violent acts is difficult even for experts, given the strict criteria used in historical analysis. Computational constraints also restricted the use of models larger than RoBERTa Large, and limited access to expansive annotated datasets posed further challenges. The thesis concludes that while automation through LLMs can significantly accelerate the annotation process, these models are not replacements for expert human analysis but rather serve as complementary tools. Future work will aim to expand the classification framework to include additional categories, explore larger and more advanced models, and address the cultural and contextual nuances inherent in ancient texts.

Hinweis zum Urherberrecht

Für Dokumente, die in elektronischer Form über Datenenetze angeboten werden, gilt uneingeschränkt das Urheberrechtsgesetz (UrhG). Insbesondere gilt:

Einzelne Vervielfältigungen, z.B. Kopien und Ausdrucke, dürfen nur zum privaten und sonstigen eigenen Gebrauch angefertigt werden (Paragraph 53 Urheberrecht). Die Herstellung und Verbreitung von weiteren Reproduktionen ist nur mit ausdrücklicher Genehmigung des Urhebers gestattet.

Der Benutzer ist für die Einhaltung der Rechtsvorschriften selbst verantwortlich und kann bei Mißbrauch haftbar gemacht werden.