NU hosted a briefing with the participation of Professor Atakan Varol, Founder and Head of the Institute of Smart Systems and Artificial Intelligence (ISSAI), Yerbol Absalyamov, Deputy Director for Operations, and Madina Abdrakhmanova, Deputy Director for External Relations. The speakers elaborated on developing a large language model of the Kazakh language—KazLLM.
As noted by the developers, ISSAI began collecting data in March of this year and is now training the model using a cloud computing platform with a small number of NVIDIA H100 nodes.
“So far, we have made significant progress, and all the employees working on this project except me are Kazakhstani, students of NU or other universities, such as Astana IT University, Bolashak graduates, and local people. At the end of this project, we will create KazLLM, but the most important of this project will be the creation of a workforce that can create cutting-edge generative AI tools and products. And in this specific technology, we are not far behind other countries. After creating KazLLM and its models, we will be 18 months behind them. Once we integrate voice, it will be 12 months. Once we create language vision models, we will be at the cutting edge and do what those other countries do. The important thing is that we are doing this for the people of Kazakhstan in the Kazakh language”, said Professor Varol.
The diverse data sources for the project include articles from Wikipedia, news outlets, government websites, and open data sets (e.g., Common Crawl) in the public domain. Over the past five years, ISSAI has developed numerous natural language processing datasets specifically for the Kazakh language. The project addresses national and information security issues, while the use of foreign products is fraught with data leakage and the presentation of distorted information.
“The model training corpus will consist of at least 100 billion tokens comprising Kazakh, Russian, English and Turkish, with each language represented by 25 billion tokens. We now have more than 30 billion tokens. A token is a unit of data valuation, a word or part of a word. 26 billion tokens were created using the Tilmash translator to translate data from English into Kazakh. Our model can now output literate Kazakh. In addition, we will create an interactive interface for a user, as OpenAI has done now,” added Madina Abdrakhmanova, Lead Data Scientist.
Thus, to ensure widespread use, ISSAI will offer a subscription to the platform for general users and a specialized application programming interface (API) for advanced users so that the latter can integrate models into their products. The platform will support model interaction, reinforcement learning based on human feedback, and tuning for optimal performance in different scenarios. The API will allow seamless integration of the model into websites, smartphone apps, program codes and PC programs.
The first Kazakh LLM is scheduled to launch on December 16, 2024, the thirty-third anniversary of the Republic of Kazakhstan’s independence. The Minister of Digital Development, Innovation and Aerospace Industry of the Republic of Kazakhstan, NU and NIS Endowment Fund, and the NU Social Development Fund are supporting the project financially.








