Click here to learn more about author Steve Miller.
With more than a little serendipity, I came across a report detailing the results of the third annual survey by Burtch Works Executive Recruiting, entitled “SAS, R, or Python Survey 2016: Which Tool Do Analytics Pros Prefer?” The survey asked each respondent to name the single favorite of the three programming tools.
I’m generally wary of such surveys, suspicious of both a potential marketing bias by the survey- conductors, as well as skeptical of the sample’s fidelity to the population it purports to represent. At the same time, though, the combination of SAS, R, and Python resonates in my consulting world.
The Burtch survey results squared with two conclusions I’ve drawn in the past few years, thus earning important “face validity.” The first finding is the marked difference in tool utilization by age, with younger practitioners in the R/Python camp, while older colleagues identify more with SAS. A case in point: I learned SAS in statistics graduate school many years ago, while students in the same program today compute with R.
The second is that R/Python is the choice among Data Science practitioners, even as SAS remains popular among other analytics types. Burtch distinguishes Data Scientists from analytics practitioners by the former’s obsession with large, unstructured, messy data and programming. My experiences at Data Science conferences where R & Python proliferate while SAS sits on the sidelines, confirms that finding.
I worked intimately with SAS software the first 20 years of my analytics career, switching to open source R and Python 15 years ago. I’m now the giddy recipient of the exploding largesse from both the Python and R ecosystems.
And yet with all the open source analytics developments, SAS remains a big statistical dog with corporate America. So a platform that encourages interoperability between SAS, R, and Python would seem to be most welcome.
Enter World Programming System (WPS), “a powerful and versatile platform for working with data. WPS software can run programs written in the SAS language.” Essentially, WPS provides a cheaper SAS programming language along with a superior development environment. It can easily replace SAS software for most SAS programming tasks.
WPS was the choice of my current health care informatics client because SAS is the legacy data repository, but SAS software was prohibitively expensive compared with WPS. Once the customer was convinced WPS was a fully capable and seamless SAS programming environment, the choice was a no-brainer. Now, a month into development using the data step and proc SQL, they’re quite happy. A nervous consultant, I’m guardedly optimistic.
With a slick programming workbench, WPS is a pleasure to use for SAS developers. I especially like access to relational databases and proc R, which can exchange data between SAS data sets and R and also execute R code. Both of these features are available in SAS software – at additional cost.
The current “project” is primarily SAS language programming against monthly health care utilization data sets that that house millions of records. Aggregating several years can yield table volumes in the hundreds of millions of rows accumulating to terabytes of disk storage.
So far WPS has been up to the task – with one exception. The provided data sets often have well more than 100 attributes, of which the vast majority house only missing data. “Unioning” this data across months with the baggage columns creates a performance drag, even though compression helps keep the storage size down. What’s needed is a simple way to eliminate the dead attributes.
For SAS programmers, identifying the baggage columns is easy using proc FREQ, but getting rid of them is another matter. The prospect of building a column-delete procedure with proc SQL, the data step, and the macro language was less than inviting.
The solution that was ultimately adopted used WPS’s proc R to first “pull” each offending SAS data set into an R data frame, then deploy R’s functional programming features to identify the unnecessary columns and create a data frame excluding those attributes. Finally, the new R data frame is pushed back into a SAS data set. Voila.
The point is that it’s not SAS, R, or Python, but rather SAS/WPS, R, and Python. All are critical tools in the 2016 analytics professional tool chest. And having them collaborate painlessly is critical. I routinely move data between R and Python based on the availability of the newest algorithms and graphics functions in each. The feather packages available in both R and Python from the Apache Arrow project let’s me easily copy data between the platforms.
I’d like to see significant 3-way interoperability between SAS, R, and Python and believe WPS might be uniquely qualified to lead the charge. Hopefully, WPS agrees and assumes the challenge.
*Note: New proc IMPORT capabilities in a soon-to-be-released version of WPS will allow programmers to circumvent loading dead (all missing) columns into SAS data sets.