Can Project BigScience Leverage Large Language Models?


LLMs have recently captured the attention of entrepreneurs and tech enthusiasts alike, but the cost of hardware required to develop them has made it largely out of reach for many research labs

Today’s best-performing AI and NLP models are found mainly with technology giants. The hold big conglomerates have on this transformative technology – from deciding which research is shared to its impacts on environmental and ethical fronts – poses several issues. Private and restricted access makes it impossible to answer essential questions around these models, such as the study of capabilities, limitations, potential improvements, bias, and fairness.

Recently, a group of more than 500 researchers from 45 countries came together to tackle some of these problems. The project, known as Big Science, aims to improve the scientific understanding of the capabilities and limitations of large-scale neural network models in NLP, create a diverse and multilingual dataset and a large-scale language model as research artefacts that are open to the scientific community.

What Exactly Is Big Science?

BigScience is an international project started a year ago by New York-based natural language processing startup Hugging Face, with more than 500 researchers working together to understand better and improve the quality of large natural language models. Large language models, also known as LLMs, are algorithms that can recognise, predict, and generate language based on text-based datasets. LLMs have recently captured the attention of entrepreneurs and tech enthusiasts alike, but the cost of hardware required to develop LLMs has made them largely out of reach of many research labs without the resources of major players like OpenAI and DeepMind behind them.

The goal of BigScience is to create LLMs and large text datasets that will eventually be open-sourced to the broader AI community. The models will further be trained on the Jean Zay supercomputer which is one the most powerful machines in the world.

Current Hurdles

Just like all language models, LLMs learn how likely words are to occur based on examples from the text. Simpler models only look at the context within a sequence of words, whereas larger models work looking at the level of sentences or paragraphs as a whole. Examples are fed in the form of text within the training datasets, which contain terabytes to petabytes of data scraped from social media, books or software hosting platforms like GitHub, and other sources on the public web.

Although training a simple model can be achieved using commodity hardware,  the hurdles for deploying state-of-the-art LLMs are widely significant. Developments like Nvidia’s and Microsoft’s Megatron 530B can cost up to millions of dollars just to train from scratch, not even accounting for expenses that might be incurred to store the model. Inference, actually running the trained model is another barrier. According to recent reports, the estimated cost of running GPT-3 on a single Amazon Web Services instance is at a minimum of $87,000 per year. BigScience plans to work in a broader scope, with plans to not only train and release LLMs but address some of their significant technical shortcomings.

Leveraging The LLM Space Through Equality

The Hugging Face collaborative aims at creating a dataset and LLMs as tools for research, including fostering numerous discussions on the social impact of LLMs. A steering committee provides members of BigScience scientific and general advice, while the organisation committee designs the tasks and organises workshops, hackathons, and public events.

Different working groups within BigScience’s organisation committee are charged with tackling challenges like data governance, archival strategies, evaluation fairness, bias, and social impact. The goal of the BigScience working groups is to collect diverse data and be representative of the aforementioned training datasets.

BigScience claims to have already produced a catalogue of nearly 200 language resources distributed worldwide. Contributors have also created one of the largest public natural language catalogues for Arabic, called Masader, with over 200 datasets.

Current Work

BigScience has only started developing the LLMs, but its current work shows promise. With TPU Research Cloud credits and testing with several hours of computing time on the Jean Zay supercomputer, BigScience researchers trained and evaluated a model called T0 (“T5 for zero-shot”) that outperforms GPT-3 on several English-language benchmarks, although 16 times smaller than GPT-3. The most capable version, currently dubbed as T0++, can perform tasks it has not been explicitly trained to do, such as generating cooking instructions for recipes and responding to questions about religion, human ageing, machine learning, and ethics.

While the T0 was trained on a range of publicly available and English-only datasets, future models will be built on learnings from BigScience’s data-focused working groups. Still, much work remains to be done. BigScience researchers found that the T0++ unknowingly generates conspiracy theories and exhibits gender bias. The next development phase will involve experiments with a model containing 104 billion parameters, more than half the parameters in GPT-3, which will help the final step in BigScience’s NLP roadmap –  training a multilingual model with up to 200 billion total parameters.

Research scientist at Yandex contributing to BigScience’s model design work, Max Ryabinin, says that one of the significant engineering challenges is ensuring the stability of BigScience’s large-scale language model training experiments. Although it is possible to obtain smaller models without “significant issues,” at the over-10-billion-parameter scale, the process becomes much less predictable.

BigScience’s work could majorly aid enterprises to help spur a new wave of AI-powered products and have the means to leverage LLMs. Language models have become a vital tool in industries such as health care and financial services, where they are widely used to process patents, derive insights from scientific papers, recommend news articles, and much more. Nevertheless, smaller organisations have increasingly been left out of cutting-edge advancements. With all its potential to harm, LLMs still struggle with the basics, often breaking semantic rules and endlessly repeating themselves.

However, BigScience seems promising to help solve some of the most significant and most troubling issues with LLMs today with the work and efforts. Such work shows that AI is at a turning point where it can either go in a proprietary direction, where large-scale state-of-the-art models can be developed in an open, collaborative, community-oriented direction, combining the best aspects of open-source and open-science.

If you liked reading this, you might like our other stories

Are AI-Text Generators That Great?
AI Risks We Should Know About