Build a k-nearest-model for classification or regression tasks.
cuda_ml_knn(x, ...) # S3 method for default cuda_ml_knn(x, ...) # S3 method for data.frame cuda_ml_knn( x, y, algo = c("brute", "ivfflat", "ivfpq", "ivfsq"), metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine", "correlation"), p = 2, neighbors = 5L, ... ) # S3 method for matrix cuda_ml_knn( x, y, algo = c("brute", "ivfflat", "ivfpq", "ivfsq"), metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine", "correlation"), p = 2, neighbors = 5L, ... ) # S3 method for formula cuda_ml_knn( formula, data, algo = c("brute", "ivfflat", "ivfpq", "ivfsq"), metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine", "correlation"), p = 2, neighbors = 5L, ... ) # S3 method for recipe cuda_ml_knn( x, data, algo = c("brute", "ivfflat", "ivfpq", "ivfsq"), metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine", "correlation"), p = 2, neighbors = 5L, ... )
x | Depending on the context: * A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome. |
---|---|
... | Optional arguments; currently unused. |
y | A numeric vector (for regression) or factor (for classification) of desired responses. |
algo | The query algorithm to use. Must be one of
"brute", "ivfflat", "ivfpq", "ivfsq" or a KNN algorithm specification
constructed using the Descriptions of supported algorithms: - "brute": for brute-force, slow but produces exact results. - "ivfflat": for inverted file, divide the dataset in partitions and perform search on relevant partitions only. - "ivfpq": for inverted file and product quantization (vectors are divided into sub-vectors, and each sub-vector is encoded using intermediary k-means clusterings to provide partial information). - "ivfsq": for inverted file and scalar quantization (vectors components are quantized into reduced binary representation allowing faster distances calculations). Default: "brute". |
metric | Distance metric to use. Must be one of "euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "lp", "chebyshev", "linf", "jensenshannon", "cosine", "correlation". Default: "euclidean". |
p | Parameter for the Minkowski metric. If p = 1, then the metric is equivalent to manhattan distance (l1). If p = 2, the metric is equivalent to euclidean distance (l2). |
neighbors | Number of nearest neighbors to query. Default: 5L. |
formula | A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side. |
data | When a __recipe__ or __formula__ is used, |
A KNN model that can be used with the 'predict' S3 generic to make predictions on new data points. The model object contains the following: - "knn_index": a GPU pointer to the KNN index. - "algo": enum value of the algorithm being used for the KNN query. - "metric": enum value of the distance metric used in KNN computations. - "p": parameter for the Minkowski metric. - "n_samples": number of input data points. - "n_dims": dimension of each input data point.
library(cuda.ml) library(MASS) library(magrittr) library(purrr) set.seed(0L) centers <- list(c(3, 3), c(-3, -3), c(-3, 3)) gen_pts <- function(cluster_sz) { pts <- centers %>% map(~ mvrnorm(cluster_sz, mu = .x, Sigma = diag(2))) rlang::exec(rbind, !!!pts) %>% as.matrix() } gen_labels <- function(cluster_sz) { seq_along(centers) %>% sapply(function(x) rep(x, cluster_sz)) %>% factor() } sample_cluster_sz <- 1000 sample_pts <- cbind( gen_pts(sample_cluster_sz) %>% as.data.frame(), label = gen_labels(sample_cluster_sz) ) model <- cuda_ml_knn(label ~ ., sample_pts, algo = "ivfflat", metric = "euclidean") test_cluster_sz <- 10 test_pts <- gen_pts(test_cluster_sz) %>% as.data.frame() predictions <- predict(model, test_pts) print(predictions, n = 30)#> # A tibble: 30 × 1 #> .pred_class #> <fct> #> 1 1 #> 2 1 #> 3 1 #> 4 1 #> 5 1 #> 6 1 #> 7 1 #> 8 1 #> 9 1 #> 10 1 #> 11 1 #> 12 1 #> 13 1 #> 14 1 #> 15 1 #> 16 1 #> 17 1 #> 18 1 #> 19 1 #> 20 1 #> 21 1 #> 22 1 #> 23 1 #> 24 1 #> 25 1 #> 26 1 #> 27 1 #> 28 1 #> 29 1 #> 30 1