We announce a virtual talk in the JAII Lecture Series (JAII). Our presenter, Prof. Hinrich Schütze (LMU Munich, Homepage of Hinrich Schütze’s lab) is a renowned expert in computational linguistics who will talk about scaling of Large Language Models.
On July 6, Hinrich Schütze will talk about „Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages„ The talk will be virtual, this is the link to the lecture.
Large language models (LLMs) are currently the most active area of research in NLP. Most work has focused on what we call „vertical„ scaling: making LLMs even better for a relatively small number of high-resource languages. We address „horizontal„ scaling instead: extending LLMs to a large subset of the world’s languages, focusing on low-resource languages. Our Glot500-m model is trained on 500 languages, many of which are not covered by any other language model. I will talk about the major challenges we faced in creating Glot500: (i) finding, validating and cleaning training data for that many languages; (ii) evaluating performance of Glot500-m on languages for which native speakers and labeled datasets were not available to us; and (iii) determining the factors that ultimately make training on a language successful. We find that trying to reduce such factors to the so-called curse of multilinguality is naive and there is in fact also a „boon of multilinguality„. We are in the process of making Glot500-c, our training corpus covering 500 languages, publicly available.