R Data Types and "Gotchas"
Written by Scott McCoy
Some programming languages have strong data typing. Some have weak data typing. And some languages are R.
An introduction to data types in R
Some of the basic data types in R:
- Boolean (logical)
- Numeric
- Double
- Integer
- Character (string)
Some common R data structures, also called complex types:
- Factor
- Vector
- Data frame
- Dates
Conveniently, everything in R is an object, so we can use functions to get info about variables/data and learn more about their types.
- typeof()
- class()
- length()
- attributes()
For example:
> numeric_var <- 1.5
> typeof(numeric_var)
[1] "double"
> class(numeric_var)
[1] "numeric"
> length(numeric_var)
[1] 1
> attributes(numeric_var)
NULL
Another example, this time using a vector:
> factor(levels = c("a","b","c")) %>% attributes()
$levels
[1] "a" "b" "c"
$class
[1] "factor"
The R type system has a number of interesting properties:
- R is interpreted
- R is dynamically typed
- R uses lazy evaluation
This means that R verifies type safety at runtime, not compilation, since R doesn't compile like other languages. You're not going to generate an exe file with R; just hand someone a script. This means that errors are only a problem if they actually run.
For example, this doesn't throw an error because the else clause is never run:
if(TRUE) { 1+1 }
else { "a" + 1 }
Unlike other languages that require functions to change a variable's type or at least the new type name in parentheses, R uses implicit coercion. That means it's done automatically at runtime as long as it's possible (it usually is). R does also allow explicit coercion when you want to tell it what to do specifically. This usually uses the as.<class_name> functions, like as.integer() or as.list().
A type coercion "gotcha". First we create a data frame explicitly with dates...
example_data <-
as_tibble(data.frame(StartDate = c(as.Date("2022-01-01"), as.Date("2022-01-31"), as.Date("2022-03-01")),
EndDate = c(as.Date("2022-02-01"), as.Date("2022-02-28"), as.Date("2022-03-31")),
Month = c("January","February","March")))
Then we create a function that adds 1 to the start date...
add_dates <- function(row) {
row[1] + 1 # this should increment the date by 1, right?
typeof(row[1])
}
We apply the function to tibble and...
> example_data %>% apply(MARGIN = 1, FUN = add_dates)
Error in row[1] + 1 : non-numeric argument to binary operator
# gotcha
This happens because each row is passed to our add_dates() function as a vector, and vectors are homogenous – they can only contain one type. And in our tibble, the last column contains strings. That means our data that was explicitly created using the date data type gets converted to strings (without any kind of notification). Then, attempting "2022-02-01" + 1 causes a type error.
Okay, so this will work better, right...? We're explicitly only passing the dates from the tibble.
> example_data %>% select(StartDate, EndDate) %>%> apply(MARGIN = 1, FUN = add_dates)
Error in row[1] + 1 : non-numeric argument to binary operator
# gotcha again
R also converts any complex data types (like dates, factors, etc.) to basic data types (character, numeric, etc.) when it applies the function. So once again, we've tried to add a number to a character string.
Another example, this time with vectors:
example_vec <- c(1,2,3)
sum(example_vec) # nothing unusual here
[1] 6
...But adding a string to the vector coerces the whole thing to strings and ruins our summing operation.
> sum(c(example_vec, "*"))
Error in sum(c(example_vec, "*")) :
invalid 'type' (character) of argument
> typeof(c(example_vec, "*"))
[1] "character"
A similar thing happens with NA values (Not Available, a special value in R that means “no data here”), but NA obliterates the results when you use aggregate functions.
> sum(c(example_vec, NA))
[1] NA
> typeof(c(example_vec, NA))
[1] "double"
# still numeric though
As you can see, adding data to your rows can silently coerce your data frame's columns. Worse, if you're not aware of what's going on fixing it can take forever to debug. Fortunately, as long as you keep in mind R's particular way of handling data types, a solution usually isn't too far away.