14 Handling categorical data
Machine learning on image-like data can be many things: fun (dogs vs. cats), societally useful (medical imaging), or societally harmful (surveillance). In comparison, tabular data – the bread and butter of data science – may seem more mundane.
What’s more, if you’re particularly interested in deep learning (DL), and looking for the extra benefits to be gained from big data, big architectures, and big compute, you’re much more likely to build an impressive showcase on the former instead of the latter.
So for tabular data, why not just go with random forests, or gradient boosting, or other classical methods? I can think of at least a few reasons to learn about DL for tabular data:
Even if all your features are interval-scale or ordinal, thus requiring “just” some form of (not necessarily linear) regression, applying DL may result in performance benefits due to sophisticated optimization algorithms, activation functions, layer depth, and more (plus interactions of all of these).
If, in addition, there are categorical features, DL models may profit from embedding those in continuous space, discovering similarities and relationships that go unnoticed in one-hot encoded representations.
What if most features are numeric or categorical, but there’s also text in column F and an image in column G? With DL, different modalities can be worked on by different modules that feed their outputs into a common module, to take over from there.
14.1 Agenda
In this introductory chapter, we keep the architecture straightforward. We don’t experiment with fancy optimizers or nonlinearities. Nor do we add in text or image processing. However, we do make use of embeddings, and pretty prominently at that. Thus from the above bullet list, we’ll shed a light on the second, while leaving the other two for future posts.
In a nutshell, what we’ll see is
How to create a custom dataset, tailored to the specific data you have.
How to handle a mix of numeric and categorical data.
How to extract continuous-space representations from the embedding modules.
14.2 Dataset
The dataset, Mushrooms, was chosen for its abundance of categorical columns. It is an unusual dataset to use in DL: It was designed for machine learning models to infer logical rules, as in: IF a AND NOT b OR c […], then it’s an x.
Mushrooms are classified into two groups: edible and non-edible. The dataset description lists five possible rules with their resulting accuracies. While the least we want to go into here is the hotly debated topic of whether DL is suited to, or how it could be made more suited to rule learning, we’ll allow ourselves some curiosity and check out what happens if we successively remove all columns used to construct those five rules.
library(torch)
library(purrr)
library(readr)
library(dplyr)
library(ggplot2)
library(ggrepel)
download.file(
"https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data",
destfile = "agaricus-lepiota.data"
)
<- read_csv(
mushroom_data "agaricus-lepiota.data",
col_names = c(
"poisonous",
"cap-shape",
"cap-surface",
"cap-color",
"bruises",
"odor",
"gill-attachment",
"gill-spacing",
"gill-size",
"gill-color",
"stalk-shape",
"stalk-root",
"stalk-surface-above-ring",
"stalk-surface-below-ring",
"stalk-color-above-ring",
"stalk-color-below-ring",
"veil-type",
"veil-color",
"ring-type",
"ring-number",
"spore-print-color",
"population",
"habitat"
),col_types = rep("c", 23) %>% paste(collapse = "")
%>%
) # can as well remove because there's just 1 unique value
select(-`veil-type`)
In torch
, dataset()
creates an R6 class. As with most R6 classes, there will usually be a need for an initialize()
method. Below, we use initialize()
to preprocess the data and store it in convenient pieces. More on that in a minute. Prior to that, please note the two other methods a dataset
has to implement:
.getitem(i)
. This is the whole purpose of adataset
: Retrieve and return the observation located at some index it is asked for. Which index? That’s to be decided by the caller, adataloader
. During training, usually we want to permute the order in which observations are used, while not caring about order in case of validation or test data..length()
. This method, again for use of adataloader
, indicates how many observations there are.
In our example, both methods are straightforward to implement. .getitem(i)
directly uses its argument to index into the data, and .length()
returns the number of observations:
<- dataset(
mushroom_dataset name = "mushroom_dataset",
initialize = function(indices) {
<- self$prepare_mushroom_data(mushroom_data[indices, ])
data $xcat <- data[[1]][[1]]
self$xnum <- data[[1]][[2]]
self$y <- data[[2]]
self
},
.getitem = function(i) {
<- self$xcat[i, ]
xcat <- self$xnum[i, ]
xnum <- self$y[i, ]
y
list(x = list(xcat, xnum), y = y)
},
.length = function() {
dim(self$y)[1]
},
prepare_mushroom_data = function(input) {
<- input %>%
input mutate(across(.fns = as.factor))
<- input$poisonous %>%
target_col as.integer() %>%
`-`(1) %>%
as.matrix()
<- input %>%
categorical_cols select(-poisonous) %>%
select(where(function(x) nlevels(x) != 2)) %>%
mutate(across(.fns = as.integer)) %>%
as.matrix()
<- input %>%
numerical_cols select(-poisonous) %>%
select(where(function(x) nlevels(x) == 2)) %>%
mutate(across(.fns = as.integer)) %>%
as.matrix()
list(list(torch_tensor(categorical_cols), torch_tensor(numerical_cols)),
torch_tensor(target_col))
} )
As for data storage, there is a field for the target, self$y
, but instead of the expected self$x
we see separate fields for numerical features (self$xnum
) and categorical ones (self$xcat
). This is just for convenience: The latter will be passed into embedding modules, which require its inputs to be of type torch_long()
, as opposed to most other modules that, by default, work with torch_float()
.
Accordingly, then, all prepare_mushroom_data()
does is break apart the data into those three parts.
Indispensable aside: In this dataset, really all features happen to be categorical – it’s just that for some, there are but two types. Technically, we could just have treated them the same as the non-binary features. But since normally in DL, we just leave binary features the way they are, we use this as an occasion to show how to handle a mix of various data types.
Our custom dataset
defined, we create instances for training and validation; each gets its companion dataloader
:
<- sample(1:nrow(mushroom_data), size = floor(0.8 * nrow(mushroom_data)))
train_indices <- setdiff(1:nrow(mushroom_data), train_indices)
valid_indices
<- mushroom_dataset(train_indices)
train_ds <- train_ds %>% dataloader(batch_size = 256, shuffle = TRUE)
train_dl
<- mushroom_dataset(valid_indices)
valid_ds <- valid_ds %>% dataloader(batch_size = 256, shuffle = FALSE) valid_dl
14.3 Model
In torch
, how much you modularize your models is up to you. Often, high degrees of modularization enhance readability and help with troubleshooting.
Here we factor out the embedding functionality. An embedding_module
, to be passed the categorical features only, will call torch
’s nn_embedding()
on each of them:
<- nn_module(
embedding_module
initialize = function(cardinalities) {
$embeddings = nn_module_list(lapply(cardinalities, function(x) nn_embedding(num_embeddings = x, embedding_dim = ceiling(x/2))))
self
},
forward = function(x) {
<- vector(mode = "list", length = length(self$embeddings))
embedded for (i in 1:length(self$embeddings)) {
<- self$embeddings[[i]](x[ , i])
embedded[[i]]
}torch_cat(embedded, dim = 2)
} )
The main model, when called, starts by embedding the categorical features, then appends the numerical input and continues processing:
<- nn_module(
net "mushroom_net",
initialize = function(cardinalities,
num_numerical,
fc1_dim,
fc2_dim) {$embedder <- embedding_module(cardinalities)
self$fc1 <- nn_linear(sum(map(cardinalities, function(x) ceiling(x/2)) %>% unlist()) + num_numerical, fc1_dim)
self$fc2 <- nn_linear(fc1_dim, fc2_dim)
self$output <- nn_linear(fc2_dim, 1)
self
},
forward = function(xcat, xnum) {
<- self$embedder(xcat)
embedded <- torch_cat(list(embedded, xnum$to(dtype = torch_float())), dim = 2)
all %>% self$fc1() %>%
all nnf_relu() %>%
$fc2() %>%
self$output() %>%
selfnnf_sigmoid()
} )
Now instantiate this model, passing in, on the one hand, output sizes for the linear layers, and on the other, feature cardinalities. The latter will be used by the embedding modules to determine their output sizes, following a simple rule “embed into a space of size half the number of input values”:
<- map(
cardinalities 2:ncol(mushroom_data)], compose(nlevels, as.factor)) %>%
mushroom_data[ , keep(function(x) x > 2) %>%
unlist() %>%
unname()
<- ncol(mushroom_data) - length(cardinalities) - 1
num_numerical
<- 16
fc1_dim <- 16
fc2_dim
<- net(
model
cardinalities,
num_numerical,
fc1_dim,
fc2_dim
)
<- if (cuda_is_available()) torch_device("cuda:0") else "cpu"
device
<- model$to(device = device) model
14.4 Training
The training loop now is “business as usual”:
<- optim_adam(model$parameters, lr = 0.1)
optimizer
for (epoch in 1:20) {
$train()
model<- c()
train_losses
for (b in enumerate(train_dl)) {
$zero_grad()
optimizer<- model(b$x[[1]]$to(device = device), b$x[[2]]$to(device = device))
output <- nnf_binary_cross_entropy(output, b$y$to(dtype = torch_float(), device = device))
loss $backward()
loss$step()
optimizer<- c(train_losses, loss$item())
train_losses
}
$eval()
model<- c()
valid_losses
for (b in enumerate(valid_dl)) {
<- model(b$x[[1]]$to(device = device), b$x[[2]]$to(device = device))
output <- nnf_binary_cross_entropy(output, b$y$to(dtype = torch_float(), device = device))
loss <- c(valid_losses, loss$item())
valid_losses
}
cat(sprintf("Loss at epoch %d: training: %3f, validation: %3f\n", epoch, mean(train_losses), mean(valid_losses)))
}
Loss at epoch 1: training: 0.274634, validation: 0.111689
Loss at epoch 2: training: 0.057177, validation: 0.036074
Loss at epoch 3: training: 0.025018, validation: 0.016698
Loss at epoch 4: training: 0.010819, validation: 0.010996
Loss at epoch 5: training: 0.005467, validation: 0.002849
Loss at epoch 6: training: 0.002026, validation: 0.000959
Loss at epoch 7: training: 0.000458, validation: 0.000282
Loss at epoch 8: training: 0.000231, validation: 0.000190
Loss at epoch 9: training: 0.000172, validation: 0.000144
Loss at epoch 10: training: 0.000120, validation: 0.000110
Loss at epoch 11: training: 0.000098, validation: 0.000090
Loss at epoch 12: training: 0.000079, validation: 0.000074
Loss at epoch 13: training: 0.000066, validation: 0.000064
Loss at epoch 14: training: 0.000058, validation: 0.000055
Loss at epoch 15: training: 0.000052, validation: 0.000048
Loss at epoch 16: training: 0.000043, validation: 0.000042
Loss at epoch 17: training: 0.000038, validation: 0.000038
Loss at epoch 18: training: 0.000034, validation: 0.000034
Loss at epoch 19: training: 0.000032, validation: 0.000031
Loss at epoch 20: training: 0.000028, validation: 0.000027
While loss on the validation set is still decreasing, we’ll soon see that the network has learned enough to obtain an accuracy of 100%.
14.5 Evaluation
To check classification accuracy, we re-use the validation set, seeing how we haven’t employed it for tuning anyway.
$eval()
model
<- valid_ds %>% dataloader(batch_size = valid_ds$.length(), shuffle = FALSE)
test_dl <- test_dl$.iter()
iter <- iter$.next()
b
<- model(b$x[[1]]$to(device = device), b$x[[2]]$to(device = device))
output <- output$to(device = "cpu") %>% as.array()
preds <- ifelse(preds > 0.5, 1, 0)
preds
<- data.frame(preds = preds, y = b[[2]] %>% as_array())
comp_df <- sum(comp_df$preds == comp_df$y)
num_correct <- nrow(comp_df)
num_total <- num_correct/num_total
accuracy accuracy
1
Phew. No embarrassing failure for the DL approach on a task where straightforward rules are sufficient. Plus, we’ve really been parsimonious as to network size.
Before concluding with an inspection of the learned embeddings, let’s have some fun obscuring things.
14.6 Making the task harder
The following rules (with accompanying accuracies) are reported in the dataset description.
Disjunctive rules for poisonous mushrooms, from most general
to most specific:
P_1) odor=NOT(almond.OR.anise.OR.none)
120 poisonous cases missed, 98.52% accuracy
P_2) spore-print-color=green
48 cases missed, 99.41% accuracy
P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.
(stalk-color-above-ring=NOT.brown)
8 cases missed, 99.90% accuracy
P_4) habitat=leaves.AND.cap-color=white
100% accuracy
Rule P_4) may also be
P_4') population=clustered.AND.cap_color=white
These rule involve 6 attributes (out of 22).
Evidently, there’s no distinction being made between training and test sets; but we’ll stay with our 80:20 split anyway. We’ll successively remove all mentioned attributes, starting with the three that enabled 100% accuracy, and continuing our way up. Here are the results I obtained seeding the random number generator like so:
torch_manual_seed(777)
without | accuracy |
---|---|
cap-color, population, habitat |
0.9938 |
cap-color, population, habitat, stalk-surface-below-ring, stalk-color-above-ring |
1 |
cap-color, population, habitat, stalk-surface-below-ring, stalk-color-above-ring, spore-print-color |
0.9994 |
cap-color, population, habitat, stalk-surface-below-ring, stalk-color-above-ring, spore-print-color, odor |
0.9526 |
Still 95% correct … While experiments like this are fun, it looks like they can also tell us something serious: Imagine the case of so-called “debiasing” by removing features like race, gender, or income. How many proxy variables may still be left that allow for inferring the masked attributes?