Learnability of probabilistic context-free languages using small transformer models : Rath, David : INFDok

Learnability of probabilistic context-free languages using small transformer models

URL	http://edoc.sub.uni-hamburg.de/informatik/volltexte/2025/299/
Dokumentart:	Bachelor Thesis
Institut:	Fachbereich Informatik
Sprache:	Englisch
Erstellungsjahr:	2024
Publikationsdatum:	28.04.2025
Freie Schlagwörter (Deutsch):	Große Sprachmodelle , Contextfreie Grammatik , GPT-2 , KI
Freie Schlagwörter (Englisch):	Large language models , Context-free grammar , GPT-2 , AI
DDC-Sachgruppe:	Informatik
BK - Klassifikation:	54.72

Kurzfassung auf Englisch:

This thesis investigates the relationship between memorization, generalization, and creativity in GPT-2 language models trained on artificial languages generated by probabilistic context-free grammars (PCFGs). Two PCFGs of differing complexities were used to generate synthetic corpora for training GPT-2 models with varying architectural complexities and data sizes. By evaluating the models’ ability to infill missing suffixes in sequences from the training data, we observed that both overfitting and underfitting can produce outputs that superficially resemble creative generation. Models often generated novel sequences not present in the training data; however, we argue that these sequences lacked genuine creativity, as they resulted from simple recombinations of memorized patterns. Merely verifying the absence of generated sequences in the training data is insufficient for assessing creativity. We highlight the necessity of additional metrics to accurately evaluate creativity in language models. Analyzing the diversity and distribution of generated sequences—such as through suffix entropy and n-gram distributions—can be a start. Neither excessively complex models with vast amounts of data nor overly simplistic models with limited data optimally foster creativity. Achieving a balance in model complexity and training data is essential for models to learn underlying grammatical structures without defaulting to memorization or oversimplification. This study advances our understanding of language model behavior in regards to memorization and generalization and provides a foundation for developing more nuanced evaluation methods for memorization, generalization and creativity in AI.

Hinweis zum Urherberrecht

Für Dokumente, die in elektronischer Form über Datenenetze angeboten werden, gilt uneingeschränkt das Urheberrechtsgesetz (UrhG). Insbesondere gilt:

Einzelne Vervielfältigungen, z.B. Kopien und Ausdrucke, dürfen nur zum privaten und sonstigen eigenen Gebrauch angefertigt werden (Paragraph 53 Urheberrecht). Die Herstellung und Verbreitung von weiteren Reproduktionen ist nur mit ausdrücklicher Genehmigung des Urhebers gestattet.

Der Benutzer ist für die Einhaltung der Rechtsvorschriften selbst verantwortlich und kann bei Mißbrauch haftbar gemacht werden.

INFDok - Dokumentenvolltextserver des Fachbereichs Informatik