Build a k-nearest-model for classification or regression tasks.

cuda_ml_knn(x, ...)

# S3 method for default
cuda_ml_knn(x, ...)

# S3 method for data.frame
cuda_ml_knn(
  x,
  y,
  algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
  metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
    "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
    "correlation"),
  p = 2,
  neighbors = 5L,
  ...
)

# S3 method for matrix
cuda_ml_knn(
  x,
  y,
  algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
  metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
    "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
    "correlation"),
  p = 2,
  neighbors = 5L,
  ...
)

# S3 method for formula
cuda_ml_knn(
  formula,
  data,
  algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
  metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
    "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
    "correlation"),
  p = 2,
  neighbors = 5L,
  ...
)

# S3 method for recipe
cuda_ml_knn(
  x,
  data,
  algo = c("brute", "ivfflat", "ivfpq", "ivfsq"),
  metric = c("euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan",
    "braycurtis", "canberra", "minkowski", "chebyshev", "jensenshannon", "cosine",
    "correlation"),
  p = 2,
  neighbors = 5L,
  ...
)

Arguments

x

Depending on the context:

* A __data frame__ of predictors. * A __matrix__ of predictors. * A __recipe__ specifying a set of preprocessing steps * created from [recipes::recipe()]. * A __formula__ specifying the predictors and the outcome.

...

Optional arguments; currently unused.

y

A numeric vector (for regression) or factor (for classification) of desired responses.

algo

The query algorithm to use. Must be one of "brute", "ivfflat", "ivfpq", "ivfsq" or a KNN algorithm specification constructed using the cuda_ml_knn_algo_* family of functions. If the algorithm is specified by one of the cuda_ml_knn_algo_* functions, then values of all required parameters of the algorithm will need to be specified explicitly. If the algorithm is specified by a character vector, then parameters for the algorithm are generated automatically.

Descriptions of supported algorithms: - "brute": for brute-force, slow but produces exact results. - "ivfflat": for inverted file, divide the dataset in partitions and perform search on relevant partitions only. - "ivfpq": for inverted file and product quantization (vectors are divided into sub-vectors, and each sub-vector is encoded using intermediary k-means clusterings to provide partial information). - "ivfsq": for inverted file and scalar quantization (vectors components are quantized into reduced binary representation allowing faster distances calculations).

Default: "brute".

metric

Distance metric to use. Must be one of "euclidean", "l2", "l1", "cityblock", "taxicab", "manhattan", "braycurtis", "canberra", "minkowski", "lp", "chebyshev", "linf", "jensenshannon", "cosine", "correlation". Default: "euclidean".

p

Parameter for the Minkowski metric. If p = 1, then the metric is equivalent to manhattan distance (l1). If p = 2, the metric is equivalent to euclidean distance (l2).

neighbors

Number of nearest neighbors to query. Default: 5L.

formula

A formula specifying the outcome terms on the left-hand side, and the predictor terms on the right-hand side.

data

When a __recipe__ or __formula__ is used, data is specified as a __data frame__ containing the predictors and (if applicable) the outcome.

Value

A KNN model that can be used with the 'predict' S3 generic to make predictions on new data points. The model object contains the following: - "knn_index": a GPU pointer to the KNN index. - "algo": enum value of the algorithm being used for the KNN query. - "metric": enum value of the distance metric used in KNN computations. - "p": parameter for the Minkowski metric. - "n_samples": number of input data points. - "n_dims": dimension of each input data point.

Examples

library(cuda.ml) library(MASS) library(magrittr) library(purrr) set.seed(0L) centers <- list(c(3, 3), c(-3, -3), c(-3, 3)) gen_pts <- function(cluster_sz) { pts <- centers %>% map(~ mvrnorm(cluster_sz, mu = .x, Sigma = diag(2))) rlang::exec(rbind, !!!pts) %>% as.matrix() } gen_labels <- function(cluster_sz) { seq_along(centers) %>% sapply(function(x) rep(x, cluster_sz)) %>% factor() } sample_cluster_sz <- 1000 sample_pts <- cbind( gen_pts(sample_cluster_sz) %>% as.data.frame(), label = gen_labels(sample_cluster_sz) ) model <- cuda_ml_knn(label ~ ., sample_pts, algo = "ivfflat", metric = "euclidean") test_cluster_sz <- 10 test_pts <- gen_pts(test_cluster_sz) %>% as.data.frame() predictions <- predict(model, test_pts) print(predictions, n = 30)
#> # A tibble: 30 × 1 #> .pred_class #> <fct> #> 1 1 #> 2 1 #> 3 1 #> 4 1 #> 5 1 #> 6 1 #> 7 1 #> 8 1 #> 9 1 #> 10 1 #> 11 1 #> 12 1 #> 13 1 #> 14 1 #> 15 1 #> 16 1 #> 17 1 #> 18 1 #> 19 1 #> 20 1 #> 21 1 #> 22 1 #> 23 1 #> 24 1 #> 25 1 #> 26 1 #> 27 1 #> 28 1 #> 29 1 #> 30 1