IBM Project CodeNet is unprecedented. From code search and cloud detection to automatic code correction and solving legacy code problem, it has abundant potential use cases
We have been long fascinated by the possibility of computers programming computers. Can Artificial Intelligence (AI) make it easier to understand, develop, and deploy code — the language of machines? AI-powered models probably won’t make human programmers redundant — at least for now — but there’s reason to believe it will happen in the near future.
The advances in AI since the rise of deep learning has ushered in a new age of automation, but AI’s penetration in software development has been limited. Given the creative and open nature of software development, it’s very hard to create the perfect dataset for programming.
But software is everywhere. There’s a software-defined-everything model from financial services to smartphones. Google services combine for 2 billion lines of code, and a modern vehicle contains around 100 million lines of code. However, such large volumes of code is a monumental challenge to debug, maintain, and update. According to a study from the University of Cambridge’s Judge Business School, programmers spend over 50 per cent of their work time on debugging. And the total estimated cost of debugging is $312 billion per year. AI-powered code can cut development costs substantially while enabling coders to focus on more creative, less repetitive tasks.
AI for Code
There are many efforts to use AI to create new solutions that can modernise processes across the IT pipeline. For the past two and a half years, a team from IBM Research and the MIT-IBM Watson AI Lab have worked on a massive code-intensive AI for Code project. AI for Code leverages technologies like NLP and augmenting them with code analysis and compilation techniques to perform a myriad of practical tasks, such as code search and code-to-code translation. It is helping software developers improve their productivity by automating the software engineering process.
Ruchir Puri, IBM Research’s chief research scientist, discussed in a recent podcast how technologies from AI for Code are being used to modernise legacy software by helping migrate monolithic applications to microservices for IBM’s enterprise clients.
Enter Project CodeNet.
Announced at the recent IBM’s Think conference, Project CodeNet is a large dataset that aims to help teach AI how to understand and even write code. The tech giant claims it to be the largest open-source dataset for code.
It is similar to ImageNet, a huge dataset for imagery that had a dramatic impact on the field of computer vision research. Project CodeNet consists of some 14 million code samples, each of which is an intended solution to one of 4000 coding problems, and about 500 million lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
Project CodeNet aims to do for AI for Code what ImageNet did for computer vision.
“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” says IBM.
Take OpenAI’s latest language-generating GPT-3, which shows how AI is becoming adept at penning the languages of humans, as another example. But writing their own native code requires a human expert. CodeNet wants to change that.
While GPT-3 can increase human productivity by providing a basic standard, it still require help from humans to iron out errors and make up for creativity and emotion, CodeNet will lead to enhanced tools that help to speed up the writing and checking of code by humans by improving an AI’s own understanding of how to do such tasks.
Potential use cases of Project CodeNet
CodeNet can drive algorithmic innovation to extract this context with sequence-to-sequence models to make a more significant dent in machine understanding of code as opposed to machine processing of code. With code samples curated from open programming competitions over the years, its high-quality metadata and annotations with a rich set of information, be it the code size, memory footprint, CPU run time, or status, which indicates acceptance or error types, Project CodeNet is unprecedented.
Solve legacy code problem
The dataset is constructed in a manner that enables bidirectional translation, which means it can take legacy COBOL code, which is still used in many business applications, banking transactions in many countries, and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java. Manually upgrading COBOL programs is a significant undertaking in human effort, money, and scope, in such cases Project CodeNet could help.
Given its wealth of programs written in a multitude of languages, Project CodeNet may serve as a valuable benchmark dataset for source-to-source translation. It can be used to develop machine learning models for programming tasks. Since each coding challenge in the dataset contains submissions of various programming languages, data scientists can use it to create machine learning models that translate code from one language to another.
According to IBM, this can be useful for organisations that want to port old code to new languages and make them accessible to newer generations of programmers and maintainable with new development tools.
According to Github, the problem-submission relationship in CodeNet can be used for code search and clone detection. The code samples are labelled with their acceptance status and AI techniques can be explored to distinguish correct codes from problematic ones. Its metadata also enables the tracking of how a submission evolves from problematic to accepted, which could be used for exploring automatic code correction.
Regression studies and prediction
The rich annotations added to CodeNet make it suitable for a diverse set of tasks as opposed to other coding datasets that are specialised for specific programming tasks. Since each code sample is labeled with CPU run time and memory footprint, it can be used for regression studies and prediction.
Code recommendation and optimisation
CodeNet can also help to develop machine learning models for code recommendation. Recommendation tools could be as simple as autocomplete-style models that finish the current line of code to more complex systems that write full functions or blocks of code.
Since CodeNet has a wealth of metadata about memory and execution-time metrics, data scientists can also use it to develop code optimisation systems. Or they can use the error-type metadata to train machine learning systems that flag potential flaws in source code.
A more advanced use case that would be interesting to see is code generations. CodeNet is a rich library of textual descriptions of problems and their corresponding source code. There have already been several examples of developers using advanced language models such as GPT-3 to generate code from natural language descriptions. It will be interesting to see whether CodeNet can help fine tune these language models to become more consistent in code generation.
Also Read: Take a Close Look at OCR
Early success of CodeNet
The IBM website highlights an early example of how CodeNet was used to modernise legacy code: “For example, a large automotive client approached IBM to help update a $200 million asset consisting of 3,500, multi-generation Java files. These files consisted of more than one million lines of code, developed over a decade with multiple generations of Java technology.
It was a complex monolithic application code, not conducive to cloud-based environments. By applying our AI for Code stack, we reduced the business’s year-long ongoing code migration process down to just four weeks, modernising and generating over 25 new cloud-native micro services by refactoring the legacy monolithic application code.”
That example is sure to be the first of many in the years to come which have been greatly sped up, and improved.