Posted by Hanjo Odendaal on February 16, 2017 R
dplyr
Introduction of Parameterized dplyr expression
The usefullness of any small function you write will eventually be judged upon its ability to be generically applied across any arbitrary data. As I explored a blog post from Dec 2016, I became a lot more interested in writing dynamic code with dplyr functions that form part of the data wrangling silo in my analytical flow. This ability came with the new replyr package - No longer will I have the need to break up my data processing when columns have to be changed as my code depends on certain column names in my dataset that is currently in use.
‘replyr allows you to encapsulate complex dplyr expressions without the use of the lazyeval package, which is the currently recommended way to manage dplyr‘s use of non-standard evaluation’
The example replyr provides works out summary statistics of an arbitrary column, with the ability to group by another column. Imagine you had a quick and easy function which could pump out summary statistics without too much fuss. Lets take a look at the replyr construct of such a function provided by the package maintainers Win-Vector:
For my analysis I will be using the snail dataset.
This dataset contains data on the probability of a snail surviving given certain stimuli such as:
exposure in weeks
relative humidity (4 levels)
temperature, in degrees Celsius (3 levels)
deaths
Lets now see how the function outputs with a simple example using the snail dataset.
As you can see, the summary statistics for the death column was worked out. For those who noticed the negtive sdlower - no, snails did not wake up, its purely for example purposes.
Expanding this idea of by includeing a grouping variable, we can get the summary statistics per specie:
So, we can see the function works well in the sense that you can now write dynamic loops which could apply dplyr functions to an arbitrary column and grouping variable of your choice based on string inputs…
But that was not the end
In one part of the Win-Vector blog they have the following challenge:
To write such a function in dplyr can get quite hairy, quite quickly. Try it yourself, and see
So, with this in mind I have, with some fighting I might add, developed a similar function that has some added benefits beyond that of the replyr function using the dplyr package.
By combining the one_of and the standard standard evaluation of the dplyr functions we have recreated the results of the replyr function:
The added benefit of using the dplyr notation is the ability to include mutliple columns in the function that you want to summarise:
The replyr approach will not be able to handle this and will you give an error such as:
And with a little bit of tidying of the data, you can have a much richer dataset
I do find that applying parameterized dplyr functions an amazing advantage when working with more advanced ETL workflows. This is by no means to say that the replyr package is redundant, but more a declaration to say that dplyr has the ability to dynamically conduct analysis - albeit some fighting to get it to work.