R data.table symbols and operators you should know

[ad_1]

R data.table code becomes more efficient — and elegant — when you take advantage of its special symbols and functions. With that in mind, we’ll look at some special ways to subset, count, and create new columns.

For this demo, I’m going to use data from the 2019 Stack Overflow developers survey, with about 90,000 responses. If you want to follow along, you can download the data from Stack Overflow.

[youtube https://www.youtube.com/watch?v=x6ufCO7H9pY]

If the data.table package is not installed on your system, install it from CRAN and then load it as usual with library(data.table). To start, you may want to read in just the first few rows of the data set to make it easier to examine the data structure. You can do that with with data.table’s fread() function and the nrows argument. I’ll read in 10 rows:

data_sample <- fread("data/survey_results_public.csv", nrows = 10)

As you’ll see, there are 85 columns to examine. (If you want to know what all the columns mean, there are files in the download with the data schema and a PDF of the original survey.) 

To read in all the data,  I’ll use:

mydt <- fread("data/survey_results_public.csv")

Next, I’ll create a new data.table with just a few columns to make it easier to work with and see results. A reminder that data.table uses this basic syntax: 

mydt[i, j, by]

The data.table package introduction says to read this as “take dt, subset or reorder rows using i, calculate j, grouped by by.” Keep in mind that i and j are similar to base R’s bracket ordering: rows first, columns second. So i is for operations you’d do on rows (choosing rows based on row numbers or conditions); j is what you’d do with columns (select columns or create new columns from calculations). However, note also that you can do a lot more inside data.table brackets than a base R data frame. And the “by” section is new to data.table.

Since I’m selecting columns, that code goes in the “j” spot, which means the brackets need a comma first to leave the “i” spot empty:

mydt[, j]

Select data.table columns

One of the things I like about data.table is that it’s easy to select columns either quoted or unquoted. Unquoted is often more convenient (that’s usually the tidyverse way). But quoted is useful if you’re using data.table inside your own functions, or if you want to pass in a vector you created somewhere else in your code.

You can select data.table columns the typical base R way, with a conventional vector of quoted column names. For example: 

dt1 <- mydt[, c("LanguageWorkedWith", "LanguageDesireNextYear", 
"OpenSourcer", "CurrencySymbol", "ConvertedComp”,
“Hobbyist”)]

If you want to use them unquoted, create a list instead of a vector and you can pass in the unquoted names. 

dt1 <- mydt[, list(LanguageWorkedWith, LanguageDesireNextYear, 
OpenSourcer, CurrencySymbol, ConvertedComp,
Hobbyist)]

And now we come to our first special symbol. Instead of typing out list(), you can just use a dot:

dt1 <- mydt[, .(LanguageWorkedWith, LanguageDesireNextYear, 
OpenSourcer, CurrencySymbol, ConvertedComp,
Hobbyist)]

That .() is a shortcut for list() inside data.table brackets.

What if you want to use an already-existing vector of column names? Putting the vector object name inside data.table brackets won’t work. If I create a vector with quoted column names, like so: 

mycols <- c("LanguageWorkedWith", "LanguageDesireNextYear", 
"OpenSourcer", "CurrencySymbol", "ConvertedComp", "Hobbyist")

Then this code will not work: 

dt1 <- mydt[, mycols]

Instead, you need to put .. (that’s two dots) in front of the vector object name:

dt1 <- mydt[, ..mycols]

Why two dots? That seemed kind of random to me until I read the explanation. Think of it like the two dots in a Unix command-line terminal that move you up one directory. Here, you’re moving up one namespace, from the environment inside data.table brackets up to the global environment. (That really does help me remember it!)

Count data.table rows

On to the next symbol. To count by group, you can use data.table’s .N symbol, where .N stands for “number of rows.” It can be the total number of rows, or number of rows per group if you’re aggregating in the “by” section. 

This expression returns the total number of rows in the data.table: 

mydt[, .N]

The following example calculates the number of rows grouped by one variable: whether people in the survey also code as a hobby (the Hobbyist variable).

mydt[, .N, Hobbyist]

# returns:
Hobbyist N 1: Yes 71257 2: No 17626

You can use the plain column name within data.table brackets if there is just one variable. If you want to group by two or more variables, use the . symbol. For example:

mydt[, .N, .(Hobbyist, OpenSourcer)]

To order results from highest to lowest, you can add a second set of brackets after the first. The .N symbol automatically generates a column named N (of course you can rename it if you want), so ordering by the number of rows can look something like this:

mydt[, .N, .(Hobbyist, OpenSourcer)][order(Hobbyist, -N)]

As I learn data.table code, I find it helpful to read it step by step. So I’d read this as “For all rows in mydt (since there’s nothing in the “I” spot), count number of rows, grouping by Hobbyist and OpenSourcer. Then order first by Hobbyist and then number of rows descending.” 

That’s equivalent to this dplyr code:

mydf %>%
count(Hobbyist, OpenSourcer) %>%
order(Hobbyist, -n)

If you find the tidyverse conventional multi-line approach more readable, this data.table code also works:

mydt[, .N, 
.(Hobbyist, OpenSourcer)][
order(Hobbyist, -N)
]

Add columns to a data.table

Next, I’d like add columns to see if each respondent uses R, if they use Python, if they use both, or if they use neither. The LanguageWorkedWith column has information about languages used, and a few rows of that data look like this:

Languages Worked With Sharon Machlis

Several rows of the LanguagesWorkedWith column of Stack Overflow developer survey data.

Each answer is a single character string. Most have multiple languages separated by a semicolon.

As is often the case, it’s easier to search for Python than R, since you can’t just search for “R” in the string (Ruby and Rust also contain a capital R) the way you can search for “Python”. This is the simpler code to create a TRUE/FALSE vector that checks if each string in LanguageWorkedWith contains Python:

ifelse(LanguageWorkedWith %like% "Python", TRUE, FALSE)

If you know SQL, you’ll recognize that “like” syntax. I, well, like %like%. It’s a nice streamlined way to check for pattern matching. The function documentation says it’s meant to be used inside data.table brackets, but actually you can use it in any of your code, not just with data.tables. I checked with data.table creator Matt Dowle, who said the advice to use it inside the brackets is because some extra performance optimization happens there.

Next, here’s code to add a column called PythonUser to the data.table:

dt1[, PythonUser := ifelse(LanguageWorkedWith %like% "Python", TRUE, FALSE)]

Notice the := operator. Python has an operator like that, too, and ever since I heard it called the “walrus operator,” that’s what I call it. I think it’s officially “assignment by reference.” That’s because the code above changed the existing object dt1 data.table by adding the new column — without needing to save it to a new variable.

To search for R, I’ll use the regular expression "\bR\b" which says: “Find a pattern that starts with a word boundary — the \b, then an R, and then end with another word boundary. ( I can’t just look for “R;” because the last item in each string doesn’t have a semicolon.) 

This adds an RUser column to dt1:

dt1[, RUser := ifelse(LanguageWorkedWith %like% "\bR\b", TRUE, FALSE)]

If you wanted to add both columns at once with := you would need to turn that walrus operator into a function by backquoting it, like this:

dt1[, `:=`(
PythonUser = ifelse(LanguageWorkedWith %like% "Python", TRUE, FALSE),
RUser = ifelse(LanguageWorkedWith %like% "\bR\b", TRUE, FALSE)
)]

More useful data.table operators

There are several other data.table operators worth knowing. The %between% operator has this syntax:

myvector %between% c(lower_value, upper_value)

So if I want to filter for all responses where compensation was between 50,000 and 100,000 paid in US dollars, this code works:

comp_50_100k <- dt1[CurrencySymbol == "USD" & 
ConvertedComp %between% c(50000, 100000)]

The second line above is the between condition. Note that the %between% operator includes both the lower and upper values when it checks.

Another useful operator is %chin%. It works like base R’s %in% but is optimized for speed and is for character vectors only. So, if I want to filter for all rows where the OpenSourcer column was either “Never” or “Less than once per year” this code works:

rareos <- dt1[OpenSourcer %chin% c("Never", "Less than once per year")]

This is pretty similar to base R, except that base R must specify the data frame name inside the bracket and also requires a comma after the filter expression:

rareos_df <- df1[df1$OpenSourcer %in% c("Never", "Less than once per year"),]

The new fcase() function

For this final demo, I’ll start by creating a new data.table with just people who reported compensation in US dollars:

usd <- dt1[CurrencySymbol == "USD" & !is.na(ConvertedComp)]

Next, I’ll create a new column called Language for whether someone uses just R, just Python, both, or neither. And I’ll use the new fcase() function. At the time this article was published, fcase() was available only in data.table’s development version. If you already have data.table installed, you can update to the latest dev version with this command: 

data.table::update.dev.pkg()

The fcase() function is similar to SQL’s CASE WHEN statement and dplyr’s case_when() function. The basic syntax is fcase(condition1, "value1", condition2, "value2") and so on. A default value for “everything else” can be added with default = value.

Here is code to create the new Language column:

usd[, Language := fcase(
RUser & !PythonUser, "R",
PythonUser & !RUser, "Python",
PythonUser & RUser, "Both",
!PythonUser & !RUser, "Neither"
)]

I put each condition on a separate line because I find it easier to read, but you don’t have to.

A caution: If you’re using RStudio, the data.table structure doesn’t automatically update in the top right RStudio pane after you create a new column with the walrus operator. You need to manually click the refresh icon to see changes in the number of columns.

There are a few other symbols I won’t cover in this article. You can find a list of them in the “special symbols” data.table help file by running help("special-symbols"). One of the most useful, .SD, already has its own Do More With R article and video, “How to use .SD in the R data.table package.”

For more R tips, head to the “Do More With R” page on InfoWorld or check out the “Do More With R” YouTube playlist.

This story, “R data.table symbols and operators you should know” was originally published by

InfoWorld.

Share this post if you enjoyed! 🙂

[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *