Homegrown AI models pivot to smaller models, native Arabic data amid growing competition

Try talking to your favorite AI tool in Arabic and ask it to speak in your dialect, and you’ll be surprised how authentic it sounds. For your day-to-day requests and interactions, big players like ChatGPT and Gemini are equipped to carry out a conversation in your dialect and language of choice, including in Arabic — and do it well.

If the generic, Western AI models are so great, how — and why — are local AI players competing? The short answer is: Enterprise and sovereign adoption are a different ball game. The tasks that businesses and governments need from AI are much more complex, and a lack of Arabic native data leads to poorer results in Arabic. Data security and ethical and cultural considerations also drive the need for local AI models.

It’s not a matter of reaching OpenAI, Gemini, Meta, or Anthropic’s scale anymore. It’s a matter of ownership, control, and filling a gap. UAE models like Technology Innovation Institute’s Falcon and Mohamed bin Zayed University of AI’s Jais once touted themselves as potential rivals to other LLMs like ChatGPT and Gemini, but are now focusing on building smaller, niche enterprise tools atop existing models. This has also been the approach of startups like AI language and translation startup Tarjama and CNTXT AI.

One objective is to establish control on the sovereign level. A key driver for Abu Dhabi’s government-funded Technology Innovation Institute is to ensure full oversight and control over AI used by public institutions, including having an understanding of how, what, and where data is processed, Hakim Hacid, chief researcher at TII, told EnterpriseAM UAE.

Global models also do not meet the region’s linguistic or cultural needs, and translated English data will never accurately reflect cultural nuances, and can introduce biases, our sources agreed. “We do not face the same problems, and we do not need the same solutions [...] it shouldn’t be one-size-fits-all,” Mohammad Abu Sheikh, CEO of CNTXT AI, told EnterpriseAM UAE. Tarjama ’s CEO Noor Al Hassan echoed that sentiment. “The models are not built to think in Arabic,” she said. “You have to change the way you train in order to get better results,” she added.

But do we have enough native Arabic data out there? Most LLMs are primarily trained on English data and other common (mostly Western) languages. This leaves many languages with different alphabets, such as Arabic, neglected, with most of the Arabic data being used to train LLMs coming from English-Arabic translations. This hinders the quality of the output, and is a challenge especially for industries like healthcare and law, where accuracy and cultural nuances are more high stakes.

There are initiatives setting out to change that: Tunisian natural language processing startup Clusterlab developed a curated 101 Bn Arabic Words Dataset, which resulted from a rigorous data mining of Arabic data online to help support Arabic language models.

Data mining proved to be the main sticking point for TII as well, but the result — its open-source Falcon Arabic model — significantly outperforms other models of its size, and supports a wide range of Arabic dialects. The team avoided using translated corpora and focused on 100% native Arabic data, lead researcher at TII Basma El Amel Boussaha said.

Another of Falcon’s main attractions is its compact size. “The trend shifted towards building smaller models and this is something that we also followed,” Boussaha said. Part of the appeal is they’re easier to adapt for specific use cases — especially by smaller organizations like startups or universities, she added.

Other Arabic AI developers agree: The solution is to build smaller, but smarter. “Small models are usually better at solving one specific task than generic ones — [they’re] niche, small, and cost-efficient, and you can deploy them on-prem or on private cloud,” Al Hassan added. Tarjama’s strategy is to focus on business-specific use cases — particularly content-heavy document workflows — rather than broad, conversational AI. It does so through its Arabic-first language model, Pronoia — a 14 bn parameter model designed to solve complex tasks in Arabic.

Local AI developers are also seeing increased demand for agentic AI: Having a platform that supports the small language model and that allows enterprises to deploy task-specific agents using Arabic models offers more of an end-to-end approach that enterprise buyers are looking for, Al Hassan said. “Clients want to have agents that automate tasks, increase efficiency, and drop costs,” Al Hassan added.

Voice-first models are also in demand: CNTXT AI is taking a different approach, developing a voice-first model that handles Arabic dialects and works well in spoken government or enterprise workflows. Munsit, an Arabic speech-to-text model, is one of its core products, and is being deployed in sectors like healthcare, law, and public services. CNTXT AI has also partnered with companies like Actualize to embed its ASR into their own voice agents. The end goal? Building speech-to-speech models, Abu Sheikh said.

Still, whether you would be better off opting for a generic LLM or an Arabic-first one for your operations depends largely on the nature of your industry and application. Enterprise software firm Sensei Labs has been seeing growing demand for Arabic-native interfaces among users of its flagship product Conductor and its AI tool Harmony, particularly from public sector clients in the UAE, but has not found it necessary to shift to Arabic-specific LLMs, said Jay Goldman (LinkedIn), CEO of Sensei Labs.

“We have gotten a better result so far out of more general purpose LLMs with a much larger training set,” Goldman added. Sensei Labs uses modern standard Arabic and configurable terminology to adapt to different dialects, he explained.

What can help make local offerings more attractive? The team at TII plan to focus on introducing multimodality to its product, so it can interpret audio and video in addition to text, Boussaha said. Meanwhile, other developers agree that the more AI-ready local firms are, the better the offerings of local AI developers will be. Market interest isn’t the problem — it’s the rollout where things get tricky, Abu Sheikh said. “[Local firms] need [to have the] infrastructure and high quality data, and then you're able to build cool applications on top,” he added.

The consensus is: The window is wide open — but only if infrastructure, talent, and adoption catch up. The best way to keep those three things from falling behind is commitment to open-sourcing and benchmarking, Hacid said. “We really plan to keep [the path open] for players to play in this field,” Hacid said.

Sections

About

Privacy & Support

Get EnterpriseAM daily

Homegrown AI models pivot to smaller models, native Arabic data amid growing competition

Related

More from Enterprise

Banking

THIS MORNING: UAE puts banking system on cloud + another UAE, US tech and AI-focused meeting

Finance

More manufacturing activity, LPG hub breaks ground, Abu Dhabi and Baku twin up, Invictus takes Africa once again, and AB launches domestic feeder funds

Capital Markets

UAE too wealthy for EM status, JPMorgan says, with plans to remove it from EM bond index

Fintech

2Africa is here, as is crypto-denominated ins. + InvestSky expands to Saudi, and Khazna expands into ops

About

Privacy & Support

My EnterpriseAM

Sign in

Check your email! We've sent you a magic link to sign in.