Quick and easy classification column (data.table)

Quickly add higher-level classification column to data.

A very common use-case is to add a column to a table of data that indicates a category for each row where the category is dependent on the value of one of the table column. This task is so common it’s worth documenting here with an example. I’m a fan of data.table so I’ll set up a simple example using that.

Get the package

library(data.table)

I’m going to make a table with 1000,000 randomly chosen US states then add a region column specifying which of the 5 US region each state belongs to. This approach requires a key table. Here is such a table:

## Assign states to region
W <- data.table(region = "W", state = c("AK", "WA", "OR", "ID", "MT", "WY", "CA", "NV", "UT", "CO"))
MW <- data.table(region = "MW", state = c("ND", "MN", "WI", "MI",  "SD", "NE", "KS", "IA", "MO", "IL", "IN", "OH"))
NE <- data.table(region = "NE", state = c("NY", "PA", "MD", "VT", "ME", "NH", "MA", "RI", "CT", "NJ"))
SE <- data.table(region = "SE", state = c("WV", "DE", "DC", "VA", "KY", "NC", "TN", "AR", "LA", "MS", "AL", "GA", "SC", "FL"))
SW <- data.table(region = "SW", state = c("AZ", "NM", "OK", "TX", "HI"))

data.table has a useful row-binding command for combining a list of tables of the same column format into a single table:

region.key <- rbindlist(list(W, MW, NE, SE, SW))
## Let's look at a the first 15 rows:
region.key[1:15, ]
##     region state
##  1:      W    AK
##  2:      W    WA
##  3:      W    OR
##  4:      W    ID
##  5:      W    MT
##  6:      W    WY
##  7:      W    CA
##  8:      W    NV
##  9:      W    UT
## 10:      W    CO
## 11:     MW    ND
## 12:     MW    MN
## 13:     MW    WI
## 14:     MW    MI
## 15:     MW    SD

Make some test data

When developing, making manageable test data is really important. Here I’m making 1000,000 example rows that represent the type of data you might encounter in the wild.

## Make a large set of randomly chosen states and some other random content.
random.dt <- data.table(state = sample(region.key$state, 1000000, replace = TRUE), other.info = sample(1:100, 1000000, replace = TRUE))
## Look at the first 15 rows:
random.dt[1:15, ]
##     state other.info
##  1:    MS         36
##  2:    VA         70
##  3:    TN         16
##  4:    WY         51
##  5:    ID         70
##  6:    CT         59
##  7:    WV         71
##  8:    VA         90
##  9:    LA         69
## 10:    WI         21
## 11:    KS         28
## 12:    ME         91
## 13:    WV         16
## 14:    OH         54
## 15:    CO         23

Add category

That gives me my test data. When doing data analysis, there are many cases where it is necessary to assign observations to one or more of a set of categorical values. For example here I might want to know which region each state belongs to. Maybe I want to colour code a plot, or perform pivot-like grouping operations by region. So I’m going to add the region category by merging the data with the region key table. The most error-proof way to do this is to have a column in both tables that exactly share a name as here with the state columns. If we have a common name we can merge using that name:

## By specifying sort=FALSE common.name.dt will have the same row order as random.dt
common.name.dt <- merge(random.dt, region.key, by = "state", sort=FALSE)
## Look at first 15 rows:
common.name.dt[1:15, ]
##     state other.info region
##  1:    MS         36     SE
##  2:    VA         70     SE
##  3:    TN         16     SE
##  4:    WY         51      W
##  5:    ID         70      W
##  6:    CT         59     NE
##  7:    WV         71     SE
##  8:    VA         90     SE
##  9:    LA         69     SE
## 10:    WI         21     MW
## 11:    KS         28     MW
## 12:    ME         91     NE
## 13:    WV         16     SE
## 14:    OH         54     MW
## 15:    CO         23      W

## success!

data.table has a very sophisticated key hashing system whereby one or more columns can be specified as the key column(s), these get hashed so that subsequent sorting, searching and merging operations are extremely fast. For an example like this, the difference is not really noticeable, but it’s a good skill to have so here goes:

setkey(random.dt, "state")
setkey(region.key, "state")
## Calling merge without specifying a column will default to merging using the columns defined as the key.
common.name.dt.2 <- merge(random.dt, region.key)
## top 15 rows:
common.name.dt.2[1:15, ]
##     state other.info region
##  1:    AK         65      W
##  2:    AK         72      W
##  3:    AK         95      W
##  4:    AK         90      W
##  5:    AK         68      W
##  6:    AK         50      W
##  7:    AK         44      W
##  8:    AK         61      W
##  9:    AK         67      W
## 10:    AK         73      W
## 11:    AK         13      W
## 12:    AK         76      W
## 13:    AK         51      W
## 14:    AK         73      W
## 15:    AK         44      W
## Again success but note the rows were sorted by setting the key.

It’s important to note that setting a key column automatically sorts the table in alphabetic or numeric order depending on the data type. This could cause errors if I’d wanted to column-bind tables and had forgotten that the setkey command changed the order or rows. I’d suggest always using a reference column and merging tables rather than using column-binding unless you’re confident about the row order.

Using the category

It’s worth showing a quick example of using the new category column: a group operation. I’ll show the mean of the other.info column grouped by region.

region.means <- common.name.dt.2[, mean(other.info), by = "region"]
## Change the default returned column name from V1 to mean.value
setnames(region.means, old = "V1", new = "mean.value")
region.means[]
##    region mean.value
## 1:      W   50.54963
## 2:     SE   50.52125
## 3:     SW   50.51910
## 4:     NE   50.48674
## 5:     MW   50.44882

Summary

There are so many times when I need to assign rows to a higher category. Best way is to keep it simple, make a key table, then merge the two tables. Once you have one or more columns of categories it’s very easy to perform sophisticated pivot-type operations on groups of rows.

comments powered by Disqus
David Leedal
David Leedal
Data scientist

My research interests include data analysis, statistics, geostatistics and data visualisation.

Related