29.01.2025

As of late November 2024, the first open-source AI language model developed within the EU will be available. A consortium led by two Fraunhofer Institutes and other German research institutions was instrumental in this achievement. The unique aspect: the AI was trained using data in all 24 official EU languages.

While a $500 billion joint venture for investing in AI data centers is taking shape in the USA, the EU is working with renewed urgency to become more digitally sovereign and less dependent on foreign countries.

One beacon of hope is the OpenGPT-X joint project, funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) and led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems (IAIS) and for Integrated Circuits (IIS).

This project published the first European AI language model on the Hugging Face AI platform at the end of November 2024. “Teuken-7B” was trained on all 24 official EU languages and includes seven billion parameters.

Tokens for 24 EU Official Languages

For the developers of Teuken-7B, it was crucial that the distribution of training data reflected the official languages used in the EU. English accounts for just 41.7 percent, French 9.1 percent, German 8.7 percent, and Spanish 8.0 percent.

Bildmotiv zu Teuken-7B nutzt eine optimierte Tokenisierung, um Sprachen mit langen Wörtern effizienter zu verarbeiten,
Teuken-7B nutzt eine optimierte Tokenisierung, um Sprachen mit langen Wörtern effizienter zu verarbeiten, was die Performance verbessert und Kosten senkt. (Bildquelle: Adobe Stock / aimanasrn)

The name “Teuken” evokes tokens, the smallest meaningful unit in programming languages, and tokenization, the segmentation of text at the word level. This is likely intentional. A specially developed multilingual “tokenizer” breaks down words into individual components. “The fewer tokens, the more (energy-) efficient and faster a language model generates responses,” according to the IAIS. Compared to other multilingual tokenizers like Llama3 and Mistral, this one is said to incur significantly lower training costs.

This would particularly benefit European languages with very long words, such as German, Finnish, and Hungarian. Along with Turkish, Japanese, and Korean, these languages are agglutinative, where a standard example like “in my houses” can form a single word.

Training on the Supercomputer in Jülich

The consortium includes, among others, the Fraunhofer Institutes, the Technical University of Dresden, the German Research Center for Artificial Intelligence (DFKI), and the Jülich Research Center. The latter has provided the supercomputer JUWELS for the training. It has been available under an open-source license since November 26, 2024. For German as a language, the training costs increased by only slightly more than 20 percent. This is significantly cheaper than Llama 3 or GPT-4 from OpenAI, which incur a surcharge of over 55 percent for the same. On average, other languages besides English result in a premium of about 37 percent with Teuken, while with Llama 3 it is around 87 percent, and with GPT-4 and Mistral it is even far exceeding 100 percent.

Strengthening Digital Sovereignty

“Innovations like this bolster digital sovereignty, competitiveness, and resilience in Germany and Europe. That’s why the Federal Ministry for Economic Affairs and Climate Action (BMWK) is supporting the project with approximately 14 million euros,” explained Dr. Franziska Brantner, Parliamentary State Secretary at the BMWK.

Deutsche Telekom announced shortly after the publication on December 12, 2024, that it is the first company to commercially offer Teuken-7B. The telecommunications giant sees the “Made in Germany” AI model as a crucial step in enhancing the digital sovereignty of businesses and authorities within the EU.

“The provision of Teuken as an open-source model offers several advantages: companies can tailor the model to their specific needs – developing specialized applications, for instance. Additionally, they can decide whether to run the model locally on their own infrastructure or with a trusted cloud provider of their choice.

If desired, sensitive data can thus remain within the company,” Deutsche Telekom quotes IAIS project leader Dr. Nicolas Flores-Herr.

Developers from the scientific community and businesses will be able to download Teuken-7B for free from Hugging Face to integrate the model into chatbots or RAG (Retrieval Augmented Generation) applications, for example.

Source of title image: Adobe Stock / ShinneProject

Share this article:

More Articles

11.04.2026

Chief AI Officer 2026: Real Role or Just Another C-Level Title?

Tobias Massow

⏳ 9 min read The Chief AI Officer is the most frequently announced-and least understood-C-level ...

Read Article
10.04.2026

Cloud Repatriation 2026 Is a Statistical Illusion

Benedikt Langer

7 Min. Lesezeit "86 Prozent der CIOs planen Cloud Repatriation" lautet die Überschrift, die sich seit ...

Read Article
08.04.2026

AI Governance 2026: Only 14% Have Clarified Who Is Responsible

Tobias Massow

7 Min. Reading Time 87 percent of companies are increasing their AI (Artificial Intelligence) budgets. ...

Read Article
07.04.2026

18 Percent Pay Gap, an EU Deadline, and Little Preparation: Salary Transparency from June 2026

Benedikt Langer

8 min. reading time Starting June 2026, salary ranges must appear in job postings. Inquiring about current ...

Read Article
06.04.2026

Cyber Insurance 2026: Premiums Doubled, Coverage Halved – The Calculation No CFO Wants to See

Benedikt Langer

6 Min. Read 15.3 billion US dollars in premium volume, a 15 to 20 percent price increase for 2026, and ...

Read Article
05.04.2026

IT Budget 2027: Three Quarters for Operations – That’s the Problem

Benedikt Langer

6 min read By 2026, companies worldwide will spend $6.15 trillion on IT. That sounds like an unprecedented ...

Read Article
A magazine by Evernine Media GmbH