Why data remains the greatest challenge for machine learning projects

Join top executives in San Francisco on July 11-12, to hear how leaders are

Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More

Quality data is at the heart of the success of enterprise artificial intelligence (AI). And accordingly, it remains the main source of challenges for companies that want to apply machine learning (ML) in their applications and operations.

The industry has made impressive advances in helping enterprises overcome the barriers to sourcing and preparing their data, according to Appen's latest State of AI Report. But there is still a lot more to be done at different levels, including organization structure and company policies.

The enterprise AI life cycle can be divided into four stages: Data sourcing, data preparation, model testing and deployment, and model evaluation.

Advances in computing and ML tools have helped automate and accelerate tasks such as training and testing different ML models. Cloud computing platforms make it possible to train and test dozens of different models of different sizes and structures simultaneously. But as machine learning models grow in number and size, they will require more training data.

Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

Unfortunately, obtaining training data and annotating still requires considerable manual effort and is largely application specific. According to Appen's report, "lack of sufficient data for a specific use case, new machine learning techniques that require greater volumes of data, or teams don't have the right processes in place to easily and efficiently get the data they need."

"High-quality training data is required for accurate model performance; and large, inclusive datasets are expensive," Appen's chief product officer Sujatha Sagiraju told VentureBeat. "However, it's important to note that valuable AI data can increase the chances of your project going from pilot to production; so, the expense is needed."

ML teams can start with prelabeled datasets, but they will eventually need to collect and label their own custom data to scale their efforts. Depending on the application, labeling can become extremely expensive and labor-intensive.

In many cases, companies have enough data, but they can't deal with quality issues. Biased, mislabeled, inconsistent or incomplete data reduces the quality of ML models, which in turn harms the ROI of AI initiatives.

"If you train ML models with bad data, model predictions will be inaccurate," Sagiraju said. "To ensure their AI works well in real-world scenarios, teams must have a mix of high-quality datasets, synthetic data and human-in-the-loop evaluation in their training kit."

According to Appen, business leaders are much less likely than technical staff to consider data sourcing and preparation as the main challenges of their AI initiatives. "There are still gaps between technologists and business leaders when understanding the greatest bottlenecks in implementing data for the AI lifecycle. This results in misalignment in priorities and budget within the organization," according to the Appen report.

"What we know is that some of the biggest bottlenecks for AI initiatives lie in lack of technical resources and executive buy-in," Sagiraju said. "If you take a look at these categories, you see that the data scientists, machine learning engineers, software developers and executives are dispersed across different areas, so it's not hard to imagine a lack of aligned strategy due to conflicting priorities between the various teams within the organization."

The variety of people and roles involved in AI initiatives makes it hard to achieve this alignment. From the developers managing the data, to the data scientists dealing with on-the-ground issues, and the executives making strategic business decisions, all have different goals in mind and therefore different priorities and budgets.

However, Sagiraju sees that the gap is slowly narrowing year over year when it comes to understanding the challenges of AI. And this is because organizations are better understanding the importance of high-quality data to the success of AI initiatives.

"The emphasis on how important data — especially high-quality data that match with application scenarios — is to the success of an AI model has brought teams together to solve these challenges," Sagiraju said.

Data challenges are not new to the field of applied ML. But as ML models grow bigger and data becomes more abundantly available, there is a need to find scalable solutions to assemble quality training data.

Fortunately, a few trends are helping companies overcome some of these challenges, and Appen's AI Report shows that the average time spent in managing and preparing data is trending down.

One example is automated labeling. For example, object detection models require the bounding boxes of each object in the training examples to be specified, which takes considerable manual effort. Automated and semi-automated labeling tools use a deep learning model to process the training examples and predict the bounding boxes. The automated labels are not perfect, and a human labeler must review and adjust them, but they speed up the process significantly. In addition, the automated labeling system can be further trained and improved as it receives feedback from human labelers.

"While many teams start off with manually labeling their datasets, more are turning to time-saving methods to partially automate the process," Sagiraju said.

At the same time, there is a growing market for synthetic data. Companies use artificially generated data to complement the data they collect from the real world. Synthetic data is especially useful in applications where obtaining real-world data is costly or dangerous. An example is self-driving car companies, which face regulatory, safety and legal challenges in obtaining data from real roads.

"Self-driving cars require incredible amounts of data to be safe and prepared for anything once they hit the road, but some of the more complex data is not readily available," Sagiraju said. "Synthetic data allows practitioners to account for edge cases or dangerous scenarios like accidents, crossing pedestrians and emergency vehicles to effectively train their AI models. Synthetic data can create instances to train data when there isn't enough human-sourced data. It's critical in filling in the gaps."

At the same time, the evolution of the MLops market is helping companies tackle many challenges of the machine learning pipeline, including labeling and versioning datasets; training, testing, and comparing different ML models; deploying models at scale and keeping track of their performance; and gathering fresh data and updating the models over time.

But as ML plays a greater role in enterprises, one thing that will become more important is human control.

"Human-in-the-loop (HITL) evaluations are imperative to delivering accurate, relevant information and avoiding bias," Sagiraju said. "Despite what many believe about humans actually taking a backseat in AI training, I think we’ll see a trend towards more HITL evaluations in an effort to empower responsible AI, and have more transparency about what organizations are putting into their models to ensure models perform well in the real world."

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

VentureBeat's mission

How OpenAI, ChatGPT Fuel Data Labeling Work and Economic Impact

AstroNova Acquires Astro Machine, a Leader in Printing Technology for Labeling and Mailing Applications

News

Why data remains the greatest challenge for machine learning projects