The NOOR model is based on the popular Transformer architecture
Abu Dhabi-based Technology Innovation Institute (TII) has launched NOOR, which will provide data for automated summarisation, chatbots, and personalised marketing in Arabic. The NOOR model is based on the popular Transformer architecture. As a decoder-only model, similar in structure to GPT-3, it is programmed to tackle generative tasks with architecture upgraded to reflect the latest developments in the world of machine learning, including improvements such as better positional embeddings.
To help ensure quality at scale in the NOOR dataset, the TII team designed an automated filtering pipeline based on machine learning techniques. These tools identify text like quality references and safeguard the model from exposure to spam content.
To build NOOR, researchers at TII designed an end-to-end pipeline for the collection of data, including crawling, filtering, and curation at scale. TII’s specialists also built services for extreme-scale distributed training and serving to deliver applications with efficient inference and model specialisation. NOOR’s training dataset is the world’s largest cross-domain Arabic dataset, combining web data with books, poetry, news articles, and technical information to significantly widen the applicability of the model.
“The uniquely large Arabic dataset collected to train the model is the result of months of work that included curating, scrapping, and filtering of varied sources,” said Dr Ebtesam Almazrouei, Director, AI Cross-Center Unit, TII.
Leveraging 3D parallelism, NOOR was trained on a computing resource with 128 A100 GPUs, allowing for the distribution of computations and ensuring efficient use of the available hardware resources. Named for the Arabic word “light”, the model has been so-called to establish the correlation of the Arabic language model to enlighten the mind. It represents the United Arab Emirates’ global contribution to advanced technology and artificial intelligence.