Click here to learn more about author Steve Miller.
In last month’s blog, I compared several functions that compute frequencies and crosstabs in R. The ones I’ve worked with primarily, and the foci of Part 1, were table from the base package, xtabs from the stats package, and count from Hadley Wickham’s plyr package.
Tests were conducted on a data set consisting of three years of Medicare provider utilization summaries available from cms.gov. The final R data.frame/data.table, medicarephysician, produced from code in the cell below, is meaty, with over 27M records and 30 attributes. I performed the analyses for Part I and Part II below on my Wintel notebook with 64GB RAM.
The 27M+ medicarephysician records consume slightly more than 6GB of memory. Note the usage of addon packages data.table, readr, and dplyr.
In Part 2 this month, I perform the same tests as Part 1, but instead I’ll use the capabilities and functions of the data.table and dplyr packages.
First, load the medicarephysician data.table.
Next, use data.table package functionality to compute frequencies for each of the four examples considered in Part I. The first cell tallies counts for the nppescredentials attribute of medicarephysician, which has a relatively high cardinality of 16,000+. Notice the sub-second performance, easily beating the best from last month.
The same data.table approach applied to the high cardinality npi attribute (over 1M distinct values) outperforms Part I’s competitors by an order of magnitude.
Similarly, the data.table solution for the third test of bivariate frequencies or crosstabs with attributes nppescredentials and providertype substantially bests last month’s solutions.
And finally, the 4th cell counting null and not null frequencies completes in well less than a second — again much faster than last month’s tests. Indeed, the sum of timings of all four examples with data.table is around four seconds, a no contest winner over last month.
It’s pretty easy to package the data.table frequencies capabilities in a reusable R function. frequenciesdyn, below, takes two strings as arguments: the first, the name of the data.table, and the second, the names of one or more attributes for which tallies are desired. It then builds and dynamically executes strings to tally counts and calculate percent of totals using R’s eval-parse capability. Each of the examples above are re-run using this function — with comparable performance.
With the frequenciesdyn procedure building block in place, it’s pretty straightforward to craft a driver function that loops through a data.table’s columns and tallies frequencies of each in turn. The well-worn genallnullfreq function below returns the null/not null frequencies for each column in medicarephysician by year.
Finally the generic genallfreq function returns the frequencies of all columns of a data.table. The returned data can be quite large if the input table is big and columns have high cardinalities. In the example below, I return frequencies on a subset of eleven table columns.
The use of the R community-developed packages data.table and the tidyr family from Hadley Wickham has significantly enhanced both capacity and performance of basic data analysis tasks in R. Indeed, so fundamental are these to exploratory analysis that they now must be central to any serious R work.
I’ll continue presenting ideas on data exploration/analysis in R in subsequent blogs.