IBM’s AI research division has released a 14-million-sample dataset to develop machine learning models that can help in programming tasks. Called Project CodeNet, the dataset takes its name after ImageNet, the famous repository of labelled photos that triggered a revolution in computer vision and deep learning.
While there’s a scant chance that machine learning models built on the CodeNet dataset will make human programmers redundant, there’s reason to be hopeful that they will make developers more productive.
Automating programming with deep learning
In the early 2010s, impressive advances in machine learning triggered excitement (and fear) about artificial intelligence soon automating many tasks, including programming. But AI’s penetration in software development has been extremely limited.
Human programmers discover new problems and explore different solutions using a plethora of conscious and subconscious thinking mechanisms. In contrast, most machine learning algorithms require well-defined problems and a lot of annotated data to develop models that can solve the same problems.
There have been many efforts to create datasets and benchmarks to develop and evaluate :AI for code” systems. But given the creative and open nature of software development, it’s very hard to create the perfect dataset for programming.
Also Read: How Has Machine Learning Impacted Marketing?
With Project CodeNet, the researchers at IBM have tried to create a multi-purpose dataset that can be used to train machine learning models for various tasks. CodeNet’s creators describe it as a “very large scale, diverse, and high-quality dataset to accelerate the algorithmic advances in AI for Code.”
The dataset contains 14 million code samples with 500 million lines of code written in 55 different programming languages. The code samples have been obtained from submissions to nearly 4,000 challenges posted on online coding platforms AIZU and AtCoder. The code samples include both correct and incorrect answers to the challenges.
One of the key features of CodeNet is the amount of annotation that has been added to the examples. Every one of the coding challenges included in the dataset has a textual description along with CPU time and memory limits. Every code submission has a dozen pieces of information, including the language, the date of submission, size, execution time, acceptance, and error types.
The researchers at IBM have also gone through great effort to make sure the dataset is balanced along different dimensions, including programming language, acceptance, and error types.
Programming tasks for machine learning
CodeNet is not the only dataset to train machine learning models for programming tasks. But a few characteristics that make it stand out. First is the sheer size of the dataset, including the number of samples and the diversity of the languages.
But perhaps more important is the metadata that goes with the coding samples. The rich annotations added to CodeNet make it suitable for a diverse set of tasks as opposed to other coding datasets that are specialized for specific programming tasks.
There are several ways CodeNet can be used to develop machine learning models for programming tasks. One is language translation. Since each coding challenge in the dataset contains submissions of various programming languages, data scientists can use it to create machine learning models that translate code from one language to another. This can be handy for organisations that want to port old code to new languages and make them accessible to newer generations of programmers and maintainable with new development tools.
CodeNet can also help to develop machine learning models for code recommendation. Recommendation tools could be as simple as autocomplete-style models that finish the current line of code to more complex systems that write full functions or blocks of code.
Since CodeNet has a wealth of metadata about memory and execution-time metrics, data scientists can also use it to develop code optimisation systems. Or they can use the error-type metadata to train machine learning systems that flag potential flaws in source code.
A more advanced use case that would be interesting to see is code generations. CodeNet is a rich library of textual descriptions of problems and their corresponding source code. There have already been several examples of developers using advanced language models such as GPT-3 to generate code from natural language descriptions. It will be interesting to see whether CodeNet can help finetune these language models to become more consistent in code generation.
The researchers at IBM have already conducted several experiments with CodeNet, including code classification, code similarity evaluation, and code completion. The deep learning architectures they used include simple multi-layer perceptrons, convolutional neural networks, graph neural networks, and Transformers. The results, reported in a paper that details Project CodeNet, show that they have been able to obtain above 90 per cent accuracy in most tasks.
(With inputs from agencies)