Researchers from Brown University and MIT have developed a new data science framework called Tuplex, that allows users to process data queries written in Python up to 90 times faster than data systems like Apache Spark or Dask. They have made the software freely available to all.
Platforms like Spark perform data analytics by distributing tasks across multiple processor cores or machines in a data centre. That parallel processing allows users to deal with giant data sets that would choke a single computer to death. Users interact with these platforms by inputting their own queries, which contain custom logic written as user-defined functions (UDFs).
Because it’s user-friendly, Python is the language of choice for creating UDFs. In fact, a recent KDNuggets poll found that 66 per cent of data platform users use Python as their primary language.
Despite its popularity, Python is not particularly fast, says Leonhard Spiegelberg, a PhD student at Brown University and the lead developer of Tuplex. “Using Python UDFs is incredibly inefficient,” Spiegelberg says in a video about Tuplex posted to YouTube.
There are several reasons why Python UDFs are slow, Spiegelberg says in his presentation. The number one reason is that Python UDF code must run through a stack-based bytecode interpreter before execution.
The challenge is that analytics platforms have trouble dealing with bits of Python code efficiently. Data platforms are written in high-level computer languages that are compiled before running. Compilers are programs that take computer language and turn it into machine code that a computer processor can quickly execute. Python, however, is not compiled beforehand. Instead, computers interpret Python code line by line while the program runs, which can mean far slower performance.
These frameworks have to break out of their efficient execution of compiled code and jump into a Python interpreter to execute Python UDFs.
Researchers have tried for years to create a general-purpose Python compiler, but they have not been successful.
So now, instead of trying to make a general Python compiler, the Brown University and MIT researchers designed Tuplex to compile a highly specialised program for the specific query and common-case input data.
Tuplex represents a different approach to the performance problem. Instead of building a general-purpose system, the researchers created a domain-specific approach to speed up specific Python UDFs.
Tuplex works on Python UDFs that share a certain commonality. For everything else that doesn’t fit into Tuplex’s narrow use case or that returns an error, Tuplex runs through the standard interpreter. After these two systems run in parallel, the results are merged for the final result.
Uncommon input data, which account for only a small percentage of instances, are separated and referred to an interpreter. Tuplex allows data users to simplify the compilation problem as they only need to care about a single set of data types and common-case assumptions. This way, they get the best of two worlds: high productivity and fast execution speed.
Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimised LLVM bytecode for the given pipeline and input data set. Tuplex is based on data-driven compilation and dual-mode processing, two fundamental techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.
Benchmark tests performed by Spiegelberg and his colleagues show that Tuplex can deliver a big performance increase. The researchers built five standard data science pipelines that include a map, join, and filter operations, and ran them against standard Python UDFs using the Spark and Dask frameworks. The researchers used the latest, greatest techniques wherever they could, including using Spark SQL functions.
Tuplex returned queries anywhere from three times faster to 38 times faster compared to the hand-tuned Spark and Dask programs, which is a substantial improvement in performance.
In addition to speeding things up, Tuplex also has an innovative way of dealing with anomalous data, the researchers say. Large datasets are often messy, full of corrupted records or data fields that don’t follow convention.
Sometimes, inconsistencies — such as sometimes records will be imputed numerically, and sometimes the numbers will be spelt out — crash some data platforms. But Tuplex extracts those anomalies and sets them aside to avoid a crash. Once the program has run, the user can repair those anomalies. Undoubtedly, Tuplex has a significant productivity impact on data scientists.
The code is open source and can be downloaded at tuplex.cs.brown.edu/gettingstarted.html.