Description

The GPT-NL initiative, a collaboration between TNO, SURF, and NFI, aims to support the Large Language Model community through the development of a state-of-the-art research facility and the pursuit of a competitive yet responsibly designed language model. We believe that technology should be reciprocal, trustworthy, transparent, and ensure sovereignty of our citizens and institutions and want to reflect these values with GPT-NL.

Problem Context

We believe that technology should be reciprocal, trustworthy, transparent, and ensure sovereignty of our citizens and institutions. However, the current LLM-market does not offer sufficient transparency and cannot ensure that they comply with our laws and values. Secondly, the quality of current LLMs, that are trained on translated Dutch texts, is insufficient when it comes to understanding Dutch texts in complex cases.

Solution

We offer an alternative to existing LLMs by creating a useful and yet law compliant and sovereign large language model. By offering a useful alternative to existing LLMs, we’ll demonstrate that it is possible to create language models that comply with our laws and values.

  • Trustworthy: GPT-NL will be built from scratch to ensure a clean data chain. The model will be trained on a combination of only opt-in data, data that is legally accepted for the training of LLMs, and non-IP infringing synthetic data.
  • Reciprocal: We work closely with data providers, we only use opt-in data, and are working towards a new system where copyright holders can be compensated for the use of their data.
  • Transparent: To be as transparent as possible, the GPT-NL Model will be supported by extensive documentation on the GPT-NL model to enable understanding and transparency, including a datasheet describing qualities of the dataset and a model card describing qualities of the model. By sharing the knowledge gained during model development we aim to foster a collaborative and open environment for LLM research.
  • Sovereign: By building a LLM from scratch in accordance with Dutch norms and value, we ensure and enforce sovereignty of our citizens and institutions.

Technical aspects: GPT-NL will be trained on at least three hundred billion text tokens using mostly Dutch, English, German, and Code sources. The model will be able to perform text generation, summarization, and simplification tasks at a level of performance comparable to the Llama2 7B model, GPT-3 175B models. The dataset will be finished with 1,5 hundred billion tokens of code.

Next steps

The Data Acquisition is the first step of building GPT-NL. It is also a time-consuming phase, since we want to do everything in line with legislation and in cooperation with data providers. That’s why we are currently still focusing on collecting the data. We will start training GPT-NL in Q2 of 2025. That doesn’t mean we didn’t reach other milestones. A reflection on the First Year of GPT-NL can be read in our Progress Report: Progress report #1: Eén jaar GPT-NL

Contact

  • Contact point GPT-NL, , e-mail: info@gpt-nl.nl