Advertisement

Dealing with Outliers in Big Data

By on

outby Angela Guess

Lisa Morgan recently wrote in InformationWeek, “Data analytics has its own vocabulary that business decision-makers are under pressure to learn. Beware, though, because technical terms are often used loosely, sometimes to the detriment of individuals and their companies. An outlier is a good example. A lot of people are talking about outliers, but not a lot of people understand why they exist, what causes them, and what should be done with them, if anything. ‘An outlier is a member of a defined dataset which has a dramatically different value than the other members of the set. It can be the result of measurement or recording errors, or the unintended and truthful outcome resulting from the set’s definition,’ said Tom Bodenberg, chief economist and data consultant at market research firm Unity Marketing in an interview.”

Morgan goes on, “Outliers make their way into reported statistics every day.Sometimes their inclusion or exclusion is obvious, and sometimes it isn’t. For example, in 1984 the University of Virginia reported that the average starting salary of Rhetoric and Communications graduates was $55,000. However, an outlier was skewing the analysis. The dataset included one hundred graduates with $25,000 salaries and NBA first draft pick Ralph Sampson, another graduate. His starting salary exceeded $1 million. Outliers can pop up for different reasons. Some are caused by mistakes made by humans or machines. Others represent actual data. Most business professionals haven’t considered the difference, and they have no idea what to do with them. One tactic is to include outliers in a dataset or exclude outliers from a dataset as a matter a course, without considering the potential consequences. While it’s true that the inclusion or removal of outliers may have little or no effect on an analysis, the opposite may be true.”

Read more here.

Photo credit: Flickr/ Marc_Smith

Leave a Reply