by Angela Guess
Svetoslav Marinov recently wrote in Information Management, “A friend of mine recently reminded me of the notorious quote from Frederick Jelinek (the father of modern speech recognition), “Anytime a linguist leaves the group the recognition rate goes up.” I remember being quite upset about it, back during my lnguistics studies. Is it really so that if we exchange the domain experts (i.e. phonologists in his case) with pure engineers, the performance of the system will improve? This led me to think about my domain this time. Given a system that heavily utilizes machine learning (ML), what are the things that make its performance go up: the domain experts or the lack of these? For me it is clearly the former and here I will defend why.”
Marinov goes on, “Just to set up the scene, I work in legal arena, a highly specialized domain with well-defined tasks, where we provide technology to support, augment and increase the productivity of legal teams and departments. We utilize both supervised ML techniques (i.e. we have access to labeled data, we know what is true and false) and unsupervised ML (i.e. all we have is just raw data). Let us focus on the first case only – you have a task to solve and you have labeled data to train a ML system. That applies to many highly specialized domains where data is unique. So here is my list of personal favorites, influenced by my experience, and I will highlight the two that are really crucial for the success of such a product.”
Marinov’s list begins, “(1) A system to scale up testing and training. It is not only Google, Microsoft or Facebook that can afford such a system. It pays off to have an internal platform at hand where the engineer(s) can quickly test new hypothesis, try out or implement new algorithms, run anything from simple Bayesian classifiers to the more time consuming deep learning. And as we focus on narrow domains here, Andrew Ng, the chief scientist at Baidu recently said: ‘Most of the value of deep learning today is in narrow domains where you can get a lot of data’.”
Photo credit: Flickr