This vignette provides a comparison of r2r with the
same-purpose CRAN package {hash}
,
which also offers an implementation of hash tables based on R
environments. We first describe the features offered by both packages,
and then perform some benchmark timing comparisons. The package versions
referred to in this vignette are:
library(hash)
library(r2r)
packageVersion("hash")
#> [1] '2.2.6.3'
packageVersion("r2r")
#> [1] '0.1.2'
Both r2r and hash hash tables are built
on top of the R built-in environment
data structure, and
have thus a similar API. In particular, hash table objects have
reference semantics for both packages. r2r
hashtable
s are S3 class objects, whereas in
hash the data structure is implemented as an S4
class.
Hash tables provided by r2r
support arbitrary type keys
and values, arbitrary key comparison and hash functions, and have
customizable behaviour (either throw an exception or return a default
value) upon query of a missing key.
In contrast, hash tables in hash
currently support only
string keys, with basic identity comparison (the hashing is performed
automatically by the underlying environment
objects);
values can be arbitrary R objects. Querying missing keys through
non-vectorized [[
-subsetting returns the default value
NULL
, whereas queries through vectorized
[
-subsetting result in an error. On the other hand,
hash
also offers support for inverting hash tables (an
experimental feature at the time of writing).
The table below summarizes the features of the two packages
Feature | r2r | hash |
---|---|---|
Basic data structure | R environment | R environment |
Arbitrary type keys | X | |
Arbitrary type values | X | X |
Arbitrary hash function | X | |
Arbitrary key comparison function | X | |
Throw or return default on missing keys | X | |
Hash table inversion | X |
We will perform our benchmark tests using the CRAN package microbenchmark
.
We start by timing the insertion of:
N <- 1e4
random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:
chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE),
sample(chars, n, replace = TRUE)
)
set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)
We test both the non-vectorized ([[<-
) and vectorized
([<-
) operators:
microbenchmark(
`r2r_[[<-` = {
for (i in seq_along(keys))
m_r2r[[ keys[[i]] ]] <- values[[i]]
},
`r2r_[<-` = { m_r2r[keys] <- values },
`hash_[[<-` = {
for (i in seq_along(keys))
m_hash[[ keys[[i]] ]] <- values[[i]]
},
`hash_[<-` = m_hash[keys] <- values,
times = 30,
setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[<- 73.15497 101.19601 123.23291 120.70340 149.92083 235.91341 30
#> r2r_[<- 67.17624 85.44836 103.92107 107.57778 113.90151 163.21633 30
#> hash_[[<- 66.88788 90.67790 106.40589 100.23766 124.09627 155.29444 30
#> hash_[<- 34.70616 40.14854 55.36985 54.41418 67.05011 80.97422 30
As it is seen, r2r
and hash
have comparable
performances at the insertion of key-value pairs, with both vectorized
and non-vectorized insertions, hash
being somewhat more
efficient in both cases.
We now test key query, again both in non-vectorized and vectorized form:
microbenchmark(
`r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
`r2r_[` = { m_r2r[ keys ] },
`hash_[[` = { for (key in keys) m_hash[[ key ]] },
`hash_[` = { m_hash[ keys ] },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[ 88.506934 107.90197 129.94920 126.92241 143.20243 235.45419 30
#> r2r_[ 80.246759 96.68337 124.55897 127.78043 150.00866 192.47604 30
#> hash_[[ 9.485611 10.41136 13.41568 11.43595 16.60224 19.10863 30
#> hash_[ 54.773653 60.96693 75.73965 77.62732 84.81879 142.22554 30
For non-vectorized queries, hash
is significantly faster
(by one order of magnitude) than r2r
. This is likely due to
the fact that the [[
method dispatch is handled natively by
R in hash
(i.e. the default [[
method
for environment
s is used ), whereas r2r
suffers the overhead of S3 method dispatch. This is confirmed by the
result for vectorized queries, which is comparable for the two packages;
notice that here a single (rather than N
) S3 method
dispatch occurs in the r2r
timed expression.
As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:
set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
`r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
`hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_[[_bis 63.246906 72.27537 98.67727 106.38751 113.51598 144.8277 30
#> hash_[[_bis 9.225976 10.33559 14.19306 13.33145 17.08293 20.8617 30
The results are similar to the ones already commented. Finally, we
test the performances of the two packages in checking the existence of
keys (notice that here has_key
refers to
r2r::has_key
, whereas has.key
is
hash::has.key
):
set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r2r_has_key 61.66642 64.32739 85.44097 81.99933 90.27575 220.5858 30
#> hash_has_key 182.94590 206.08397 241.14628 231.47002 276.57920 335.6451 30
The results are comparable for the two packages, r2r
being slightly more performant in this particular case.
Finally, we test key deletion. In order to handle name collisions, we
will use delete()
(which refers to
r2r::delete()
) and del()
(which refers to
hash::del()
).
microbenchmark(
r2r_delete = { for (key in keys) delete(m_r2r, key) },
hash_delete = { for (key in keys) del(key, m_hash) },
hash_vectorized_delete = { del(keys, m_hash) },
times = 30,
setup = {
m_r2r <- hashmap(); m_r2r[keys] <- values
m_hash <- hash(); m_hash[keys] <- values
}
)
#> Unit: milliseconds
#> expr min lq mean median uq
#> r2r_delete 115.392996 147.61843 177.144172 172.255813 197.328868
#> hash_delete 64.489795 69.71294 91.796102 88.709191 110.996443
#> hash_vectorized_delete 1.643938 1.94541 2.616518 2.576448 3.044892
#> max neval
#> 277.499105 30
#> 138.975424 30
#> 4.311987 30
The vectorized version of hash
significantly outperforms
the non-vectorized versions (by roughly two orders of magnitude in
speed). Currently, r2r
does not support vectorized key
deletion 1.
The two R packages r2r
and hash
offer hash
table implementations with different advantages and drawbacks.
r2r
focuses on flexibility, and has a richer set of
features. hash
is more minimal, but offers superior
performance in some important tasks. Finally, as a positive note for
both parties, the two packages share a similar API, making it relatively
easy to switch between the two, according to the particular use case
needs.
This is due to complications introduced by the internal
hash collision handling system of r2r
.↩︎