Click to learn more about author Matt Habiger.
Data is often not created for purposes that please data scientists. It is often collected for operations or billing, and as such, a significant amount of preparation time is needed to make it ready for data science.
This is clearly the case with location data sourced from wireless carriers. The original purpose of this data was for network planning. In order to understand how best to build out a cellular network, wireless carriers needed to know where people were likely to use their devices, what areas were likely to be difficult to reach, how to do handoffs from tower to tower, and so on. Over time, the use of this data has evolved to help serve everything from emergency location services, advertising purposes, and even weather detection.
There are typically several vendors that operate the cellular network for any given wireless carrier across different generations of technologies, devices, spectrum, and hardware. Add the fact that there are tens of billions of geo-locatable events being generated across the country per day, and the scale of the opportunity and challenge becomes evident.
How do you process this data most efficiently? Through data science processing pipelines (DSPP). The purpose of these pipelines is to render a refined data product that retains most of the information and usefulness of the data set while enhancing its usability. To this end, data scientists can follow the POFMU principle, which means “Process Once For Many Uses.” It’s the absolute best practice if you are building data sets for general consumption. It helps guide building data features that are useful – but not too specific – for any single application. At the same time, keep in mind that you need to retain features that will allow other data scientists to build those specific features.
The following process defines POFMU and shapes the DSPP boundaries.
- Define a possible set of candidate use cases for a data set.
- Identify long poles of processing for those candidate use cases.
- Identify processing patterns that are common among the use cases.
- Determine data features that are most likely to be used.
- Define a data structure.
Saving the details of these steps for another day, this process, when done among a mind meld of product, data science, and business, can be magical. A word of caution: Data scientists should have good familiarity and a deep understanding of the data generating process prior to this exercise. A data scientist should effectively have a mental SWOT (strengths, weaknesses, opportunities, strengths) of the candidate data set, which is critical for creating maximum value.
While this is all fine and dandy from a logical perspective, it isn’t the meat of the work for establishing a DSPP. A typical DSPP is an amalgamation of exploratory data analysis (EDA), outlier analysis, dimensionality reduction, and feature enrichment. Remember that all these steps are performed to render a refined data product that retains most of the information and usefulness of the data set while enhancing its usability. The latter is the result of POFMU. In the rest of this article, I’ll discuss applying DSPP to network-sourced location data from my own
Case Study: Processing Network-Sourced Location Data
Problem Statement
Network-sourced location data is generated from a mobile device when it connects to mobile stations (cell towers). Algorithms are applied, which generate an estimate of a device’s location. The intricacies of how these algorithms work can be covered another time, but suffice it to say that the strengths of this data set are its volume and consistency. Millions of devices are seen day-in and day-out, and a large, consistent panel of devices are seen every day. In contrast, a typical GPS panel will have 50% churn or more in a given month, while 10% or less of the panel will be seen consistently.
To make the DSPP process more tangible, I’ll walk you through our steps to create an intelligence entity that identifies where a device dwells.
Signals Validation and Testing Platform
First, we start with a platform whose sole purpose is to measure the quality of location data sets. The platform matches two data sets on common user keys with time thresholding and then outputs a host of metrics to benchmark the entire DSPP.
We are fanatical about measuring how a data set is impacted when we change underlying algorithms. So, knowing that a tweak to our filtering algorithm reduces ping-to-ping error by 25 meters but increases routing error by 300 meters is critical. The Signals Validation and Testing Platform came out of an exploratory data analysis exercise aimed at understanding the biases and errors present in wireless carrier location data. The need for such a platform to monitor changes post-production release has never been more apparent.
Filtering Algorithm
Think about the typical user behavior – within any given day, there are periods of motion and rest – generating hundreds of TB of raw information daily.
Having 200 pings evenly spaced throughout a day could potentially tell you as much as 600. At a certain point, information is superfluous, so filtering out data that contains low information is extremely useful. To do this, we’ve developed an algorithm that takes ideas from both the Kalman Filter and the Particle Filter. Both approaches aim to take a noisy measurement data and either smooth out or estimate the uncertainty around a particular measurement.
A benefit of the Particle Filter over the Kalman filter is its application to nonlinear systems. A nice aspect of the Kalman Filter is the derivation of Kalman gain or loss. This measure helps you understand if the filter is improving or degrading. Our filter both aims to understand the information gain from any particular point but also determine how much belief we should put in that measurement. You can consider this filter cognizant of the past and omniscient of the future since it draws from both t-n and t+n observations.
One can consider this row-wise dimensionality reduction rather than columnar reduction. Some data scientists have difficulty throwing away any information, yours truly included. However, after much testing, we’ve found that we can reduce the data size by roughly 60% and improve our clustering or modality algorithms via intelligent filtering. In addition, we save on compute in these downstream processes.
Clustering Algorithm
Many people spend a great deal of their day relatively stationary. To an alien observer only cognizant of a ping on a nondescript map, it seems as though arbitrary boundaries constrain one’s movement. Then suddenly, the person will break loose of the boundary and move in some deterministic path and once again become bounded. What is important here is that these arbitrary boundaries don’t always have known physical boundaries. Their size often varies, and the density of information might vary.
Because of these issues, many clustering algorithms have difficulty identifying periods of dwell. Even the best algorithms need to be customized to deal with edge cases that exist. Our go-to algorithm is spatial, temporal DBSCAN. However, in a signal rich environment with a slow-moving device, clusters can grow arbitrarily large. This can be particularly vexing in dense urban environments such as Manhattan. While we have considered further customizing DBSCAN to account for nuances, we decided to turn this into a classification problem of whether the ping was stationary or not by applying machine learning and deep learning algorithms to these hundreds of millions of training examples.
As we all know, wireless carrier networks are large and dynamic. The density of pings in a region and accuracy might be highly influenced by the number of devices, towers, as well as maintenance or hardware changes. Again, this is where the Signal Validation Testing Platform comes into play. Having a system that can measure changes in underlying signal accuracy allows us to create a continuous feedback loop. This is a virtuous loop since even knowing of degradation in frequency or accuracy is beneficial because it is incorporated into the model. In this way, it allows for the self-learning of parameters.
These processes have enabled us to transform 20 billion raw pings into 200 million clusters per day. The median number of stationary clusters for a device is around 3 to 4, which intuitively correlates to observations of how people move about throughout their day. The important point is that singleton clusters can exist. Allowing for pings in motion to be singletons proves useful when trying to understand a device’s journey. At this point, the pipeline has done more row-wise dimensionality reduction and feature enrichment.
In fact, the features are rich enough to define an intelligence entity called Visits, which is comprised of a cluster and its attributes. These attributes can be the start and stop time of a cluster (from which a dwell can be computed), the cluster boundary, and the weighted centroid of the cluster. The intersection of a cluster boundary with the boundary of a point of interest boundary generates a visit.
Modality Algorithm
To understand how the device traversed from cluster to cluster (also known as “the path”), it is also useful to understand if the path was completed by walking, driving, biking, or taking public transit. Remember those singletons from clustering? They come in handy to estimate the mode that was taken in moving from cluster to cluster. Again, we’ve added information to our data set, which allows for an array of rich downstream applications.
Conclusion
By focusing on the POFMU principle, the outcome of location DSPP yields a standardized intelligence entity that can now be applied to multiple industry verticals to answer hundreds of distinct problems.
To summarize, the steps below show each logical step in the processing pipeline.
- Define a possible set of candidate use cases for a data set.
- Any vertical application which could benefit from an understanding of how, when, and where populations move: urban planning, fintech applications, out-of-home advertising use cases, transportation planning, and so on
- Identify the long poles of processing for those candidate use cases.
- Billions of points per day
- Routing is computationally intensive
- Identify modality and dwell is non-trivial
- NSL data can be messy/noisy
- Identify the processing patterns which are common among the use cases
- Clean up data
- Map data to roads
- Map data points of interest
- Determine how long someone stays
- Determine the mode of travel
- Determine data features that are most likely to be used.
- Modality
- Dwell Time
- Distance Traveled
- Bearing
- Speed
- Boundary of dwell
- Craft a data structure which retains sufficient information and yields sufficient storage compression.