Breaking Down Big Data for Scalable, Ethical, and Accurate Data Platforms

Breaking Down Big Data for Scalable, Ethical, and Accurate Data Platforms

Professor Dekel sheds light on how adapts to evolving hardware and software environments, ensuring scalability and seamless integration with IT infrastructures and other data-related tools.

Explore the future of data platforms with insights from Professor Shai Dekel, the Active Chairman of, a leading NanoTasking data platform. He delves into the evolving landscape of data processing and the pivotal role of in shaping this transformation. addresses the challenges posed by the rapid advancement of Large Language Models (LLMs), such as Chat-GPT, and how it ensures robust commercial deployment in domains like customer service and marketing intelligence. 

Learn about’s innovative methodologies for handling large-scale data processing, optimising performance, and managing diverse data types, from structured to unstructured and semi-structured. Understand the platform’s approach to maintaining data accuracy, completeness, and consistency, especially in high-volume and real-time data streams.

Professor Dekel sheds light on how adapts to evolving hardware and software environments, ensuring scalability and seamless integration with IT infrastructures and other data-related tools. Explore industry-specific use cases where’s NanoTasking approach demonstrates significant advantages over traditional data processing techniques, offering a glimpse into the platform’s diverse applications.

Excerpts from the interview; 

How do you foresee the future of data platforms and’s role in shaping it?

The exciting capabilities of Chat-GPT released by OpenAI in November 2022 unleashed a wave of rapid advancement of Large Language Models (LLMs). Throughout 2023, OpenAI released GPT-4. Google released a new version of Bard, Meta released open-source versions of LLaMA, TII released the open-source Falcon model, etc. The size of these models ranges from tens to hundreds of billions of parameters, and they have been trained on huge corpora of data, mostly composed of carefully pre-filtered and pre-processed web pages. 

While the conversational performance of these models is remarkable, it is difficult to deploy them as robust commercial solutions in domains such as customer service, ecommerce, marketing intelligence, etc. As enterprises are currently beginning to grasp, two key elements are required to achieve successful commercial deployment of an LLM: carefully integrating the enterprise’s data and modelling the model to achieve high ‘brand accuracy’. provides an advanced model guidance platform that can be easily configured to teach models to operate within the enterprise’s safety rails and regulatory guidelines, ensure privacy, avoid bias, and provide fairness. 

How does ensure its powerful solutions align with ethical and regulatory standards?

The guidance platform allows sophisticated machine/human training data creation, feedback, and verification workflows. Complex problems are broken down into smaller nano tasks, and crowds of targeted labellers whose knowledge and skillset are tested to align with the brand are onboarded to provide statistically significant guidance. In some difficult cases or where it is evident statistically that the crowd disagrees, the tasks are automatically routed to domain experts by the guidance platform. 

Another important capability in selecting contributors is data collection and creation. The platform allows the onboarding of targeted contributors to create certain types of prompts and then onboard other labellers to verify the validity of these prompts. In data creation, ‘red-teaming’ is a critical component. ‘Red-teaming’ is a term borrowed from the cyber-security world that implies trying to attack the model with prompts corresponding to edge cases or clear policy violations.’s platform onboards users who are requested to challenge the model. Then, other users are asked to rate if the model responds to deviations clearly and ethically, emphasising the established safety rails within the LLM. Red-teaming at scale using a global crowd, with adequate rewards for successful attacks, can be very effective and ensure the model does not violate any specific geolocation or cultural safety rails. Finally,’s guidance platform further provides real-time verification of models deployed in production using hybrid machine/human resources, leveraging an unbiased crowd at scale.

How does handle large-scale data processing and optimise performance?’s platform can be configured to create guidance workflows that combine AI and human model guidance at scale. The platform allows the quick onboarding of huge crowds of labellers, each targeted and tested to answer a specific nano question. If needed, the crowd could be targeted for a specific geolocation or language or to be sampled globally in cases where bias needs to be minimised.’s unique approach to large-scale crowd onboarding implies that model guidance can be scaled up and down quickly without the need to hire and manage large teams. 

How does manage diverse data types?

The ability to break up complex labelling tasks or a model guidance problem into nano tasks allows the management and processing of diverse data types. In cases where the data is composed of media: video, image and text, the nano task approach implies that each crowd member will only review short clips from the video, a patch of an image or a paragraph of the text. Then, these low-level outputs will be combined for further high-level verification. In certain cases, machine learning models will process some of the nano tasks, either given full autonomy over the nano task or configured to cast their vote alongside human labellers.

How does ensure data accuracy and consistency in high-volume real-time streams?’s platform was certainly built with one goal: to provide scale accuracy. This is achieved by placing tools from sequential sampling where each nano task is given to as many labellers as needed until statistical significance is achieved. This is a unique capability since most labelling platforms struggle to allocate sufficient resources. Typically, only one or two contributors label a complete complex sample that is not broken into nano tasks. Furthermore, in the small percentage of cases where many labellers do not agree, the platform automatically routes the specific challenging nano tasks to a team of experts, who correctly label it and then potentially add it to the human and machine training process.

Share industry-specific use cases where’s NanoTasking outperforms traditional methods

LLMs guidance processes should naturally be broken down to nano tasking: prompt creation, verification of generated prompts, model reinforcement learning, red-teaming via attempts to create malicious prompts, etc. 

Another typical example is labelling in the space of computer vision for autonomous driving. Complex, long video streams are subdivided into short clips or even frames. Each user is given specific nano tasks of identifying cars, pedestrians, crossroads, and road signs and then the classification of identified road signs. The outputs of all the nano tasks are then assembled into a fully labelled video stream that can be used to train the autonomous driving model.   

How does ensure smooth integration with IT infrastructures and other data tools?’s platform provides flexible interfaces to various incoming and outbound data streams. It supports many forms of data such as video, web-site links, social network posts composed of text and images, etc.