Advertisement

Strategies for Acquiring High-Quality Training Data

By on

offby Angela Guess

Moritz Mueller-Freitag recently wrote in Dataconomy, “Access to high-quality training data is critical for startups that use machine learning as the core technology of their business. While many algorithms and software tools are open sourced and shared across the research community, good datasets are usually proprietary and hard to build. Owning a large, domain-specific dataset can therefore become a significant source of competitive advantage, especially if startups can jumpstart data network effects (a situation where more users → more data → smarter algorithms → better product → more users).”

Mueller-Freitag goes on to share five strategies for acquiring useful training data. His first strategy is manual work: “Building a good proprietary dataset from scratch almost always means putting a lot of up-front, human effort into data acquisition and performing manual tasks that don’t scale. Examples of startups that have used brute force in the beginning are plentiful. For instance, many chatbot startups employ human ‘AI trainers’ who manually create or verify the predictions their virtual agents make (with varying degrees of success and a high employee turnover rate). Even the tech giants resort to this strategy: all responses by Facebook M are reviewed and edited by a team of contractors.”

His next suggestion is narrowing the domain: “Most startups will try to collect data directly from users. The challenge is to convince early adopters to use the product before the benefits of machine learning fully kick in (because data is needed in the first place to train and fine-tune the algorithms). One way around this catch-22 is to drastically narrow the problem domain (and expand the scope later if needed). As Chris Dixon says: ‘The amount of data you need is relative to the breadth of the problem you are trying to solve’.”

Read more here.

Photo credit: Flickr/ Balaji Dutt

Leave a Reply