Teuken-7B AI Model: Open, multilingual power for you

29.01.2025

4 min read

As of late November 2024, the first open-source AI language model developed within the EU will be available. A consortium led by two Fraunhofer Institutes and other German research institutions was instrumental in this achievement. The unique aspect: the AI was trained using data in all 24 official EU languages.While a $500 billion joint venture for investing in AI data centers is taking shape in the USA, the EU is working with renewed urgency to become more digitally sovereign and less dependent on foreign countries.

One beacon of hope is the OpenGPT-X joint project, funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) and led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems (IAIS) and for Integrated Circuits (IIS).

This project published the first European AI language model on the Hugging Face AI platform at the end of November 2024. “Teuken-7B” was trained on all 24 official EU languages and includes seven billion parameters.

Tokens for 24 EU Official Languages

For the developers of Teuken-7B, it was crucial that the distribution of training data reflected the official languages used in the EU. English accounts for just 41.7 percent, French 9.1 percent, German 8.7 percent, and Spanish 8.0 percent. The name “Teuken” evokes tokens, the smallest meaningful unit in programming languages, and tokenization, the segmentation of text at the word level. This is likely intentional. A specially developed multilingual “tokenizer” breaks down words into individual components. “The fewer tokens, the more (energy-) efficient and faster a language model generates responses,” according to the IAIS. Compared to other multilingual tokenizers like Llama3 and Mistral, this one is said to incur significantly lower training costs.This would particularly benefit European languages with very long words, such as German, Finnish, and Hungarian. Along with Turkish, Japanese, and Korean, these languages are agglutinative, where a standard example like “in my houses” can form a single word.

Training on the Supercomputer in Jülich

The consortium includes, among others, the Fraunhofer Institutes, the Technical University of Dresden, the German Research Center for Artificial Intelligence (DFKI), and the Jülich Research Center. The latter has provided the supercomputer JUWELS for the training. It has been available under an open-source license since November 26, 2024. For German as a language, the training costs increased by only slightly more than 20 percent. This is significantly cheaper than Llama 3 or GPT-4 from OpenAI, which incur a surcharge of over 55 percent for the same. On average, other languages besides English result in a premium of about 37 percent with Teuken, while with Llama 3 it is around 87 percent, and with GPT-4 and Mistral it is even far exceeding 100 percent.

Strengthening Digital Sovereignty

“Innovations like this bolster digital sovereignty, competitiveness, and resilience in Germany and Europe. That’s why the Federal Ministry for Economic Affairs and Climate Action (BMWK) is supporting the project with approximately 14 million euros,” explained Dr. Franziska Brantner, Parliamentary State Secretary at the BMWK.

Deutsche Telekom announced shortly after the publication on December 12, 2024, that it is the first company to commercially offer Teuken-7B. The telecommunications giant sees the “Made in Germany” AI model as a crucial step in enhancing the digital sovereignty of businesses and authorities within the EU.

“The provision of Teuken as an open-source model offers several advantages: companies can tailor the model to their specific needs – developing specialized applications, for instance. Additionally, they can decide whether to run the model locally on their own infrastructure or with a trusted cloud provider of their choice.

If desired, sensitive data can thus remain within the company,” Deutsche Telekom quotes IAIS project leader Dr. Nicolas Flores-Herr.

Developers from the scientific community and businesses will be able to download Teuken-7B for free from Hugging Face to integrate the model into chatbots or RAG (Retrieval Augmented Generation) applications, for example.

Source of title image: Adobe Stock / ShinneProject