How do you teach somebody to read a language if there’s nothing for them to read? This is the problem facing developers across the African continent who are trying to train AI to understand and respond to prompts in local languages.
To train a language model, you need data. For a language like English, the easily accessible articles, books and manuals on the internet give developers a ready supply. But for most of Africa’s languages — of which there are estimated to be between 1,500 and 3,000 — there are few written resources available. Vukosi Marivate, a professor of computer science at the University of Pretoria, in South Africa, uses the number of available Wikipedia articles to illustrate the amount of available data. For English, there are over 7 million articles. Tigrinya, spoken by around 9 million people in Ethiopia and Eritrea, has 335. For Akan, the most widely spoken native language in Ghana, there are none.
Of those thousands of languages, only 42 are currently supported on a language model. Of Africa’s 23 scripts and alphabets, only three — Latin, Arabic and Ge’Ez (used in the Horn of Africa) — are available. This underdevelopment “comes from a financial standpoint,” says Chinasa T. Okolo, the founder of Technēculturǎ, a research institute working to advance global equity in AI. “Even though there are more Swahili speakers than Finnish speakers, Finland is a better market for companies like Apple and Google.”
If more language models are not developed, the impact across the continent could be dire, Okolo warns. “We’re going to continue to see people locked out of opportunity,” she told CNN. As the continent looks to develop its own AI infrastructure and capabilities, those who do not speak one of these 42 languages risk being left behind.

To avoid this, Okolo says AI developers across the continent “have to reenvision the way that we undertake model development in the first place.”
This is what Marivate has done. Marivate led the South African arm of the African Next Voices project, which has made recordings of 18 languages in South Africa, Kenya and Nigeria. Over two years, the three teams collected 9,000 hours of recordings from people of different ages and locations, creating a data set which will be available for AI developers across the continent to use to train models.
The researchers would sometimes give native speakers scripts to read, but mostly gave them a prompt and recorded their responses, which were then transcribed. For Isindebele, spoken in South Africa and Zimbabwe, they had such a hard time finding written resources that they resorted to a government manual for goat herders to help write their prompts.
African Next Voices has not collected enough data to train a large language model (LLM) like ChatGPT or Gemini, which can cover thousands of topics in detail. However, Marivate says they focused their recordings on specific topics, such as health and agriculture, that were deemed the most important.
Specialized models
Using a small data set to make a generalized model would lead to a high error rate, but small, focused data sets can be highly accurate within the limited scope of a specialized model, explained Nyalleng Moorosi, a research fellow at the Distributed AI Research Institute (DAIR), who’s not affiliated with the African Next Voices project.
For her, it’s a question of “prioritizing error.” “If somebody just wants to find out what’s happening in downtown Nairobi, I can tolerate errors there,” Moorosi said, but mistakes in models that deal with topics like banking or health care could have serious consequences.
“We need to make sure that the people who build these models understand the consequences, they understand the cultures enough to understand the weight of these errors,” Moorosi told CNN.

Words and symbols, she says, have multiple meanings. The St George’s cross, for example, has associations with right wing politics in the UK which are not obvious to someone from Ghana or Lesotho. This problem is particularly prevalent with low resource languages. “There’s a lot of contextual knowledge, there is little documentation,” she says.
A study by DAIR found that social media websites had failed to recognize and remove hate speech related to ethnic violence in Ethiopia in part because automated systems and human moderators were not familiar with the slang terms being used.
Moorosi says that without this cultural understanding, it is impossible to make “AI systems perform and make judgments that are aligned to our beliefs and values.”
Although many Africans speak multiple languages, including the African and European languages that are already supported by language models, Moorosi, believes that the aim should be to make AI accessible in all languages, “even for languages that have one speaker. All languages deserve representation or preservation.”
However, a lack of data is not the only challenge facing African AI developers. Most African languages are not codified through dictionaries or grammatical studies. In Kinyarwanda, the language of Rwanda, there are three common ways of spelling the name of the country: uRwanda, Urwanda, and u Rwanda. Without rules for spellings, even the most basic text processing becomes difficult.
Another issue is the lack of data centers. The African Union warned in 2024 that only 10% of the continent’s data center demand was being met, presenting a bottleneck for Africa’s AI hopes.
The worry, for Marivate, is that if models are not made for these smaller languages, they will “disappear.” When it comes to developers creating data sets for languages that might not even have writing systems, “the model is going to have to change” he adds.
The African Next Voices project has only just finished collecting and transcribing its data. Marivate says it is currently not working on new languages, but he is already thinking about which could be next.



