Comparison with `{hash}`

This vignette provides a comparison of r2r with the same-purpose CRAN package {hash}, which also offers an implementation of hash tables based on R environments. We first describe the features offered by both packages, and then perform some benchmark timing comparisons. The package versions referred to in this vignette are:

library(hash)
library(r2r)
packageVersion("hash")
#> [1] '2.2.6.3'
packageVersion("r2r")
#> [1] '0.1.2'

Features

Both r2r and hash hash tables are built on top of the R built-in environment data structure, and have thus a similar API. In particular, hash table objects have reference semantics for both packages. r2r hashtables are S3 class objects, whereas in hash the data structure is implemented as an S4 class.

Hash tables provided by r2r support arbitrary type keys and values, arbitrary key comparison and hash functions, and have customizable behaviour (either throw an exception or return a default value) upon query of a missing key.

In contrast, hash tables in hash currently support only string keys, with basic identity comparison (the hashing is performed automatically by the underlying environment objects); values can be arbitrary R objects. Querying missing keys through non-vectorized [[-subsetting returns the default value NULL, whereas queries through vectorized [-subsetting result in an error. On the other hand, hash also offers support for inverting hash tables (an experimental feature at the time of writing).

The table below summarizes the features of the two packages

Features supported by {r2r} and {hash}
Feature	r2r	hash
Basic data structure	R environment	R environment
Arbitrary type keys	X
Arbitrary type values	X	X
Arbitrary hash function	X
Arbitrary key comparison function	X
Throw or return default on missing keys	X
Hash table inversion		X

Performance tests

We will perform our benchmark tests using the CRAN package microbenchmark.

library(microbenchmark)

Key insertion

We start by timing the insertion of:

N <- 1e4

random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:

chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE)
    )

set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)

We test both the non-vectorized ([[<-) and vectorized ([<-) operators:

microbenchmark(
    `r2r_[[<-` = {
        for (i in seq_along(keys))
            m_r2r[[ keys[[i]] ]] <- values[[i]]
    },
    `r2r_[<-` = { m_r2r[keys] <- values },
    `hash_[[<-` = { 
        for (i in seq_along(keys))
            m_hash[[ keys[[i]] ]] <- values[[i]]
    },
    `hash_[<-` = m_hash[keys] <- values,
    
    times = 30, 
    setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
#> Unit: milliseconds
#>       expr      min        lq      mean    median        uq       max neval
#>   r2r_[[<- 73.15497 101.19601 123.23291 120.70340 149.92083 235.91341    30
#>    r2r_[<- 67.17624  85.44836 103.92107 107.57778 113.90151 163.21633    30
#>  hash_[[<- 66.88788  90.67790 106.40589 100.23766 124.09627 155.29444    30
#>   hash_[<- 34.70616  40.14854  55.36985  54.41418  67.05011  80.97422    30

As it is seen, r2r and hash have comparable performances at the insertion of key-value pairs, with both vectorized and non-vectorized insertions, hash being somewhat more efficient in both cases.

Key query

We now test key query, again both in non-vectorized and vectorized form:

microbenchmark(
    `r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
    `r2r_[` = { m_r2r[ keys ] },
    `hash_[[` = { for (key in keys) m_hash[[ key ]] },
    `hash_[` = { m_hash[ keys ] },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>     expr       min        lq      mean    median        uq       max neval
#>   r2r_[[ 88.506934 107.90197 129.94920 126.92241 143.20243 235.45419    30
#>    r2r_[ 80.246759  96.68337 124.55897 127.78043 150.00866 192.47604    30
#>  hash_[[  9.485611  10.41136  13.41568  11.43595  16.60224  19.10863    30
#>   hash_[ 54.773653  60.96693  75.73965  77.62732  84.81879 142.22554    30

For non-vectorized queries, hash is significantly faster (by one order of magnitude) than r2r. This is likely due to the fact that the [[ method dispatch is handled natively by R in hash (i.e. the default [[ method for environments is used ), whereas r2r suffers the overhead of S3 method dispatch. This is confirmed by the result for vectorized queries, which is comparable for the two packages; notice that here a single (rather than N) S3 method dispatch occurs in the r2r timed expression.

As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:

set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
    `r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
    `hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>         expr       min       lq     mean    median        uq      max neval
#>   r2r_[[_bis 63.246906 72.27537 98.67727 106.38751 113.51598 144.8277    30
#>  hash_[[_bis  9.225976 10.33559 14.19306  13.33145  17.08293  20.8617    30

The results are similar to the ones already commented. Finally, we test the performances of the two packages in checking the existence of keys (notice that here has_key refers to r2r::has_key, whereas has.key is hash::has.key):

set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
    r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
    hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>          expr       min        lq      mean    median        uq      max neval
#>   r2r_has_key  61.66642  64.32739  85.44097  81.99933  90.27575 220.5858    30
#>  hash_has_key 182.94590 206.08397 241.14628 231.47002 276.57920 335.6451    30

The results are comparable for the two packages, r2r being slightly more performant in this particular case.

Key deletion

Finally, we test key deletion. In order to handle name collisions, we will use delete() (which refers to r2r::delete()) and del() (which refers to hash::del()).

microbenchmark(
    r2r_delete = { for (key in keys) delete(m_r2r, key) },
    hash_delete = { for (key in keys) del(key, m_hash) },
    hash_vectorized_delete = { del(keys, m_hash) },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>                    expr        min        lq       mean     median         uq
#>              r2r_delete 115.392996 147.61843 177.144172 172.255813 197.328868
#>             hash_delete  64.489795  69.71294  91.796102  88.709191 110.996443
#>  hash_vectorized_delete   1.643938   1.94541   2.616518   2.576448   3.044892
#>         max neval
#>  277.499105    30
#>  138.975424    30
#>    4.311987    30

The vectorized version of hash significantly outperforms the non-vectorized versions (by roughly two orders of magnitude in speed). Currently, r2r does not support vectorized key deletion ¹.

Conclusions

The two R packages r2r and hash offer hash table implementations with different advantages and drawbacks. r2r focuses on flexibility, and has a richer set of features. hash is more minimal, but offers superior performance in some important tasks. Finally, as a positive note for both parties, the two packages share a similar API, making it relatively easy to switch between the two, according to the particular use case needs.