cuML API Reference¶
Preprocessing¶
Model Selection and Data Splitting¶
cuml.preprocessing.model_selection.
train_test_split
(X:cudf.dataframe.dataframe.DataFrame, y:Union[str, cudf.dataframe.series.Series], train_size:Union[float, int]=0.8, shuffle:bool=True, seed:int=None) → Tuple[cudf.dataframe.dataframe.DataFrame, cudf.dataframe.dataframe.DataFrame, cudf.dataframe.dataframe.DataFrame, cudf.dataframe.dataframe.DataFrame]¶Partitions the data into four collated dataframes, mimicing sklearn’s train_test_split
 Parameters
 Xcudf.DataFrame
Data to split, has shape (n_samples, n_features)
 ystr or cudf.Series
Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X containing the labels
 train_sizefloat or int, optional
If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8
 shufflebool, optional
Whether or not to shuffle inputs before splitting
 seedint, optional
If shuffle is true, seeds the generator. Unseeded by default
 Returns
 X_train, X_test, y_train, y_testcudf.DataFrame
Partitioned dataframes. If y was provided as a column name, the column was dropped from the `X`s
Label Encoding¶
 class
cuml.preprocessing.
LabelEncoder
(*args, **kwargs)¶An nvcategory based implementation of ordinal label encoding
Examples
Converting a categorical implementation to a numerical one
Output:
Methods
inverse_transform
fit
(self, y:cudf.dataframe.series.Series) → 'LabelEncoder'¶Fit a LabelEncoder (nvcategory) instance to a set of categories
 ycudf.Series
Series containing the categories to be encoded. It’s elements may or may not be unique
 Returns
 selfLabelEncoder
A fitted instance of itself to allow method chaining
fit_transform
(self, y:cudf.dataframe.series.Series) → cudf.dataframe.series.Series¶Simultaneously fit and transform an input
This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)
transform
(self, y:cudf.dataframe.series.Series) → cudf.dataframe.series.Series¶Transform an input into its categorical keys.
This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer fit_transform.
 Parameters
 ycudf.Series
Input keys to be transformed. Its values should match the categories given to fit
 Returns
 ——
 encodedcudf.Series
The ordinally encoded input series
 Raises
 KeyError
if a category appears that was not seen in fit
Regression and Classification¶
Linear Regression¶

class
cuml.
LinearRegression
¶ LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.
cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.
 Parameters
 algorithm‘eig’ or ‘svd’ (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.
 fit_interceptboolean (default = True)
If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
Notes
LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.
Applications of LinearRegression
LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).
For additional docs, see scikitlearn’s OLS.
For an additional example see the OLS notebook.
Examples
import numpy as np import cudf # Both import methods supported from cuml import LinearRegression from cuml.linear_model import LinearRegression lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig") X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) reg = lr.fit(X,y) print("Coefficients:") print(reg.coef_) print("Intercept:") print(reg.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = lr.predict(X_new) print("Predictions:") print(preds)
Output:
Coefficients: 0 1.0000001 1 1.9999998 Intercept: 3.0 Predictions: 0 15.999999 1 14.999999
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If fit_intercept_ is False, will be 0.
Methods
fit
(self, X, y)Fit the model with X and y.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predicts the y for X.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params
Logistic Regression¶

class
cuml.
LogisticRegression
¶ LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.
cuML’s LogisticRegression can take arraylike objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant). It provides both singleclass (using sigmoid loss) and multipleclass (using softmax loss) variants, depending on the input variables.
Only one solver option is currently available: QuasiNewton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:
OrthantWise Limited Memory QuasiNewton (OWLQN) if there is l1
regularization  Limited Memory BFGS (LBFGS) otherwise.
Note that, just like in Scikitlearn, the bias will not be regularized.
 Parameters
 penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)
Used to specify the norm used in the penalization. If ‘none’ or ‘l2’ are selected, then LBFGS solver will be used. If ‘l1’ is selected, solver OWLQN will be used. If ‘elasticnet’ is selected, OWLQN will be used if l1_ratio > 0, otherwise LBFGS will be used.
 tol: float (default = 1e4)
The training process will stop if current_loss > previous_loss  tol
 C: float (default = 1.0)
Inverse of regularization strength; must be a positive float.
 fit_intercept: boolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 class_weight: None
Custom class weighs are currently not supported.
 max_iter: int (default = 1000)
Maximum number of iterations taken for the solvers to converge.
 verbose: bool (optional, default False)
Controls verbosity of logging.
 l1_ratio: float or None, optional (default=None)
The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1
 solver: ‘qn’, ‘lbfgs’, ‘owl’ (default=qn).
Algorithm to use in the optimization problem. Currently only qn is supported, which automatically selects either LBFGS or OWLQN depending on the condictions of the l1 regularization described above. Options ‘lbfgs’ and ‘owl’ are just convenience values that end up using the same solver following the same rules.
Notes
cuML’s LogisticRegression uses a different solver that the equivalent Scikitlearn except when there is no penalty and solver=lbfgs is chosen in Scikitlearn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to difference when using different solvers in Scikitlearn.
For additional docs, see Scikitlearn’s LogistRegression <https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_.
Examples
import cudf import numpy as np # Both import methods supported # from cuml import LogisticRegression from cuml.linear_model import LogisticRegression X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) ) reg = LogisticRegression() reg.fit(X,y) print("Coefficients:") print(reg.coef_.copy_to_host()) print("Intercept:") print(reg.intercept_.copy_to_host()) X_new = cudf.DataFrame() X_new['col1'] = np.array([1,5], dtype = np.float32) X_new['col2'] = np.array([2,5], dtype = np.float32) preds = reg.predict(X_new) print("Predictions:") print(preds)
Output:
 Attributes
 coef_: device array, shape (n_classes, n_features)
The estimated coefficients for the linear regression model.
 intercept_: device array (n_classes, 1)
The independent term. If fit_intercept_ is False, will be 0.
Methods
fit
(self, X, y)Fit the model with X and y.
predict
(self, X)Predicts the y for X.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict
(self, X)¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)
Ridge Regression¶

class
cuml.
Ridge
¶ Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.
cuML’s Ridge an arraylike object or cuDF DataFrame, and provides 3 algorithms: SVD, Eig and CD to fit a linear model. SVD is more stable, but Eig (default) is much faster. CD uses Coordinate Descent and can be faster when data is large.
 Parameters
 alphafloat or double
Regularization strength  must be a positive float. Larger values specify stronger regularization. Array input will be supported later.
 solver‘eig’ or ‘svd’ or ‘cd’ (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.
 fit_interceptboolean (default = True)
If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
Notes
Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretabiliy on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.
Applications of Ridge
Ridge Regression is used in the same way as LinearRegression, but is used frequently as it does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.
For additional docs, see scikitlearn’s Ridge.
Examples
import numpy as np import cudf # Both import methods supported from cuml import Ridge from cuml.linear_model import Ridge alpha = np.array([1e5]) ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False, solver = "eig") X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) result_ridge = ridge.fit(X, y) print("Coefficients:") print(result_ridge.coef_) print("Intercept:") print(result_ridge.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_ridge.predict(X_new) print("Predictions:") print(preds)
Output:
Coefficients: 0 1.0000001 1 1.9999998 Intercept: 3.0 Preds: 0 15.999999 1 14.999999
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If fit_intercept_ is False, will be 0.
Methods
fit
(self, X, y)Fit the model with X and y.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predicts the y for X.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params
Lasso Regression¶

class
cuml.
Lasso
¶ Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection, and improves the conditioning of the problem.
cuML’s Lasso an arraylike object or cuDF DataFrame, and uses coordinate descent to fit a linear model.
 Parameters
 alphafloat or double
Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression class. For numerical reasons, using alpha = 0 with the Lasso class is not advised. Given this, you should use the LinearRegression class.
 fit_interceptboolean (default = True)
If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
 max_iterint
The maximum number of iterations
 tolfloat, optional
The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
 selectionstr, default ‘cyclic’
If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
Examples
import numpy as np import cudf from cuml.linear_model import Lasso ls = Lasso(alpha = 0.1) X = cudf.DataFrame() X['col1'] = np.array([0, 1, 2], dtype = np.float32) X['col2'] = np.array([0, 1, 2], dtype = np.float32) y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) ) result_lasso = ls.fit(X, y) print("Coefficients:") print(result_lasso.coef_) print("intercept:") print(result_lasso.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_lasso.predict(X_new) print(preds)
Output:
Coefficients: 0 0.85 1 0.0 Intercept: 0.149999 Preds: 0 2.7 1 1.85
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If fit_intercept_ is False, will be 0.
 For additional docs, see `scikitlearn’s Lasso
 <https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`_.
Methods
fit
(self, X, y)Fit the model with X and y.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predicts the y for X.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params
ElasticNet Regression¶

class
cuml.
ElasticNet
¶ ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be smaell, and improves the conditioning of the problem.
cuML’s ElasticNet an arraylike object or cuDF DataFrame, uses coordinate descent to fit a linear model.
 Parameters
 alphafloat or double
Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
 l1_ratio: The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1.
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
 fit_interceptboolean (default = True)
If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
 max_iterint
The maximum number of iterations
 tolfloat, optional
The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
 selectionstr, default ‘cyclic’
If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
Examples
import numpy as np import cudf from cuml.linear_model import ElasticNet enet = ElasticNet(alpha = 0.1, l1_ratio=0.5) X = cudf.DataFrame() X['col1'] = np.array([0, 1, 2], dtype = np.float32) X['col2'] = np.array([0, 1, 2], dtype = np.float32) y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) ) result_enet = enet.fit(X, y) print("Coefficients:") print(result_enet.coef_) print("intercept:") print(result_enet.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_enet.predict(X_new) print(preds)
Output:
Coefficients: 0 0.448408 1 0.443341 Intercept: 0.1082506 Preds: 0 3.67018 1 3.22177
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If fit_intercept_ is False, will be 0.
 For additional docs, see `scikitlearn’s ElasticNet
 <https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html>`_.
Methods
fit
(self, X, y)Fit the model with X and y.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predicts the y for X.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params
Stochastic Gradient Descent¶

class
cuml.
SGD
¶ Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.
cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.
 Parameters
 loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)
‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression
 penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)
‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
 alpha: float (default = 0.0001)
The constant value which decides the degree of regularization
 fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
 tolfloat (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
 eta0float (default = 0.0)
Initial learning rate
 power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
 learning_rate‘optimal’, ‘constant’, ‘invscaling’,
‘adaptive’ (default = ‘constant’)
optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5
 n_iter_no_changeint (default = 5)
the number of epochs to train without any imporvement in the model
 Notes
 ——
 For additional docs, see `scikitlearn’s OLS
 <https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>
Examples
Output: .. codeblock:: python
cuML intercept : 0.004561662673950195 cuML coef : 0 0.9834546
1 0.010128272
dtype: float32
cuML predictions : [3.0055666 2.0221121]
Methods
fit
(self, X, y)Fit the model with X and y.
predict
(self, X)Predicts the y for X.
predictClass
(self, X)Predicts the y for X.

fit
(self, X, y)¶ Fit the model with X and y. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict
(self, X)¶ Predicts the y for X. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

predictClass
(self, X)¶ Predicts the y for X. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)
Random Forest¶

class
cuml.ensemble.
RandomForestClassifier
¶ Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.
Note that the underlying algorithm for tree node splits differs from that used in scikitlearn. By default, the cuML Random Forest uses a histogrambased algorithms to determine splits, rather than an exact count. You can tune the size of the histograms with the n_bins parameter.
Known Limitations: This is an initial preview release of the cuML Random Forest code. It contains a number of known limitations:
Only classification is supported. Regression support is planned for the next release.
The implementation relies on limited CUDA shared memory for scratch space, so models with a very large number of features or bins will generate a memory limit exception. This limitation will be lifted in the next release.
Inference/prediction takes place on the CPU. A GPUbased inference solution is planned for a nearfuture release release.
Instances of RandomForestClassifier cannot be pickled currently.
The code is under heavy development, so users who need these features may wish to pull from nightly builds of cuML. (See https://rapids.ai/start.html for instructions to download nightly packages via conda.)
 Parameters
 n_estimatorsint (default = 10)
number of trees in the forest.
 handlecuml.Handle
If it is None, a new one is created just for this class.
 split_algo0 for HIST and 1 for GLOBAL_QUANTILE
(default = 0) the algorithm to determine how nodes are split in the tree.
 bootstrapboolean (default = True)
Control bootstrapping. If set, each tree in the forest is built on a bootstrapped sample with replacement. If false, sampling without replacement is done.
 bootstrap_featuresboolean (default = False)
Control bootstrapping for features. If features are drawn with or without replacement
 rows_samplefloat (default = 1.0)
Ratio of dataset rows used while fitting each tree.
 max_depthint (default = 1)
Maximum tree depth. Unlimited (i.e, until leaves are pure), if 1.
 max_leavesint (default = 1)
Maximum leaf nodes per tree. Soft constraint. Unlimited, if 1.
 max_featuresfloat (default = 1.0)
Ratio of number of features (columns) to consider per node split.
 n_binsint (default = 8)
Number of bins used by the split algorithm.
 min_rows_per_nodeint (default = 2)
The minimum number of samples (rows) needed to split a node.
Examples
import numpy as np from cuml.ensemble import RandomForestClassifier as cuRFC X = np.random.normal(size=(10,4)).astype(np.float32) y = np.asarray([0,1]*5, dtype=np.int32) cuml_model = cuRFC(max_features=1.0, n_bins=8, n_estimators=40) cuml_model.fit(X,y) cuml_predict = cuml_model.predict(X) print("Predicted labels : ", cuml_predict)
Output:
Predicted labels : [0 1 0 1 0 1 0 1 0 1]
Methods
fit
(self, X, y)Perform Random Forest Classification on the input data
get_params
(self[, deep])Returns the value of all parameters required to configure this estimator as a dictionary.
predict
(self, X)Predicts the labels for X.
score
(self, X, y)Predicts the accuracy of the model for X.
set_params
(self, **params)Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

fit
(self, X, y)¶ Perform Random Forest Classification on the input data
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (int32) of shape (n_samples, 1). Acceptable formats: NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy These labels should be contiguous integers from 0 to n_classes.

get_params
(self, deep=True)¶ Returns the value of all parameters required to configure this estimator as a dictionary. Parameters ———– deep : boolean (default = True)

predict
(self, X)¶ Predicts the labels for X.
 Parameters
 Xarraylike (host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: NumPy ndarray, Numba device ndarray
 Returns
 y: NumPy
Dense vector (int) of shape (n_samples, 1)

score
(self, X, y)¶ Predicts the accuracy of the model for X.
 Parameters
 Xarraylike (host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: NumPy ndarray, Numba device ndarray
 y: NumPy
Dense vector (int) of shape (n_samples, 1)
 Returns
 accuracyfloat

set_params
(self, **params)¶ Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params. Parameters ———– params : dict of new params
QuasiNewton¶

class
cuml.
QN
¶ QuasiNewton methods are used to either find zeroes or local maxima and minima of functions, and used by this class to optimize a cost function.
Two algorithms are implemented underneath cuML’s QN class, and which one is executed depends on the following rule:
OrthantWise Limited Memory QuasiNewton (OWLQN) if there is l1
regularization  Limited Memory BFGS (LBFGS) otherwise.
cuML’s QN class can take arraylike objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).
 Parameters
 loss: ‘sigmoid’, ‘softmax’, ‘squared_loss’ (default = ‘squared_loss’)
‘sigmoid’ loss used for single class logistic regression ‘softmax’ loss used for multiclass logistic regression ‘normal’ used for normal/square loss
 fit_intercept: boolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 l1_strength: float (default = 0.0)
l1 regularization strength (if nonzero, will run OWLQN, else LBFGS). Note, that as in Scikitlearn, the bias will not be regularized.
 l2_strength: float (default = 0.0)
l2 regularization strength. Note, that as in Scikitlearn, the bias will not be regularized.
 max_iter: int (default = 1000)
Maximum number of iterations taken for the solvers to converge.
 tol: float (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 linesearch_max_iter: int (default = 50)
Max number of linesearch iterations per outer iteration of the algorithm.
 lbfgs_memory: int (default = 5)
Rank of the lbfgs inverseHessian approximation. Method will use O(lbfgs_memory * D) memory.
 verbose: bool (optional, default False)
Controls verbosity of logging.
Notes
This class contains implementations of two popular QuasiNewton methods:  Limitedmemory Broyden Fletcher Goldfarb Shanno (LBFGS) [Nocedal, Wright  Numerical Optimization (1999)]  Orthantwise limitedmemory quasinewton (OWLQN) [Andrew, Gao  ICML 2007] <https://www.microsoft.com/enus/research/publication/scalabletrainingofl1regularizedloglinearmodels/>
Examples
import cudf import numpy as np # Both import methods supported # from cuml import QN from cuml.solvers import QN X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) ) solver = QN() solver.fit(X,y) # Note: for now, the coefficients also include the intercept in the # last position if fit_intercept=True print("Coefficients:") print(solver.coef_.copy_to_host()) print("Intercept:") print(solver.intercept_.copy_to_host()) X_new = cudf.DataFrame() X_new['col1'] = np.array([1,5], dtype = np.float32) X_new['col2'] = np.array([2,5], dtype = np.float32) preds = solver.predict(X_new) print("Predictions:") print(preds)
Output:
 Attributes
 coef_array, shape (n_classes, n_features)
The estimated coefficients for the linear regression model. Note: shape is (n_classes, n_features + 1) if fit_intercept = True.
 intercept_array (n_classes, 1)
The independent term. If fit_intercept_ is False, will be 0.
Methods
fit
(self, X, y)Fit the model with X and y.
predict
(self, X)Predicts the y for X.

fit
(self, X, y)¶ Fit the model with X and y. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict
(self, X)¶ Predicts the y for X. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)
Clustering¶
KMeans Clustering¶

class
cuml.
KMeans
¶ KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomnly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.
cuML’s KMeans expects an arraylike object or cuDF DataFrame, and supports the scalable KMeans++ intialization method. This method is more stable than randomnly selecting K points.
 Parameters
 handlecuml.Handle
If it is None, a new one is created just for this class.
 n_clustersint (default = 8)
The number of centroids or clusters you want.
 max_iterint (default = 300)
The more iterations of EM, the more accurate, but slower.
 tolfloat (default = 1e4)
Stopping criterion when centroid means do not change much.
 verboseboolean (default = 0)
If True, prints diagnositc information.
 random_stateint (default = 1)
If you want results to be the same when you restart Python, select a state.
 precompute_distancesboolean (default = ‘auto’)
Not supported yet.
 init{‘scalablekmeans++’, ‘kmeans’ , ‘random’ or an ndarray}
(default = ‘scalablekmeans++’)
‘scalablekmeans++’ or ‘kmeans’: Uses fast and stable scalable kmeans++ intialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
 n_initint (default = 1)
Number of times intialization is run. More is slower, but can be better.
 algorithm“auto”
Currently uses full EM, but will support others later.
 n_gpuint (default = 1)
Number of GPUs to use. Currently uses single GPU, but will support multiple GPUs later.
Notes
KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or TSNE, and verify that they look appropriate.
Applications of KMeans
The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.
For additional docs, see scikitlearn’s Kmeans.
Examples
# Both import methods supported from cuml import KMeans from cuml.cluster import KMeans import cudf import numpy as np import pandas as pd def np2cudf(df): # convert numpy array to cuDF dataframe df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])}) pdf = cudf.DataFrame() for c,column in enumerate(df): pdf[str(c)] = df[column] return pdf a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]], dtype=np.float32) b = np2cudf(a) print("input:") print(b) print("Calling fit") kmeans_float = KMeans(n_clusters=2, n_gpu=1) kmeans_float.fit(b) print("labels:") print(kmeans_float.labels_) print("cluster_centers:") print(kmeans_float.cluster_centers_)
Output:
input: 0 1 0 1.0 1.0 1 1.0 2.0 2 3.0 2.0 3 4.0 3.0 Calling fit labels: 0 0 1 0 2 1 3 1 cluster_centers: 0 1 0 1.0 1.5 1 3.5 2.5
 Attributes
 cluster_centers_array
The coordinates of the final clusters. This represents of “mean” of each data cluster.
 labels_array
Which cluster each datapoint belongs to.
Methods
fit
(self, X)Compute kmeans clustering with X.
fit_predict
(self, X)Compute cluster centers and predict cluster index for each sample.
fit_transform
(self, X)Compute clustering and transform X to clusterdistance space.
get_params
(self[, deep])Scikitlearn style return parameter state
predict
(self, X)Predict the closest cluster each sample in X belongs to.
set_params
(self, **params)Scikitlearn style set parameter state to dictionary of params.
transform
(self, X)Transform X to a clusterdistance space.

fit
(self, X)¶ Compute kmeans clustering with X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_predict
(self, X)¶ Compute cluster centers and predict cluster index for each sample.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_transform
(self, X)¶ Compute clustering and transform X to clusterdistance space.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params
(self, deep=True)¶ Scikitlearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predict the closest cluster each sample in X belongs to.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

set_params
(self, **params)¶ Scikitlearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params

transform
(self, X)¶ Transform X to a clusterdistance space.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
DBSCAN¶

class
cuml.
DBSCAN
¶ DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.
cuML’s DBSCAN expects an arraylike object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.
 Parameters
 epsfloat (default = 0.5)
The maximum distance between 2 points such they reside in the same neighborhood.
 handlecuml.Handle
If it is None, a new one is created just for this class
 min_samplesint (default = 5)
The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).
 verbosebool
Whether to print debug spews
 max_bytes_per_batch(optional) int64
Calculate batch size using no more than this number of bytes for the pairwise distance computation. This enables the tradeoff between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.
Notes
DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.
Applications of DBSCAN
DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find nonlinearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisons in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.
For an additional example, see the DBSCAN notebook. For additional docs, see scikitlearn’s DBSCAN.
Examples
# Both import methods supported from cuml import DBSCAN from cuml.cluster import DBSCAN import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) dbscan_float = DBSCAN(eps = 1.0, min_samples = 1) dbscan_float.fit(gdf_float) print(dbscan_float.labels_)
Output:
0 0 1 1 2 2
 Attributes
 labels_array
Which cluster each datapoint belongs to. Noisy samples are labeled as 1.
Methods
fit
(self, X)Perform DBSCAN clustering from features.
fit_predict
(self, X)Performs clustering on input_gdf and returns cluster labels.
get_param_names
(self)
fit
(self, X)¶ Perform DBSCAN clustering from features.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_predict
(self, X)¶ Performs clustering on input_gdf and returns cluster labels.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 ycuDF Series, shape (n_samples)
cluster labels

get_param_names
(self)¶
Dimensionality Reduction and Manifold Learning¶
Principal Component Analysis¶

class
cuml.
PCA
¶ PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.
cuML’s PCA expects an arraylike object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.
 Parameters
 copyboolean (default = True)
If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.
 handlecuml.Handle
If it is None, a new one is created just for this class
 iterated_powerint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but slower.
 n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
 random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
 svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
 tolfloat (default = 1e7)
Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
 verbosebool
Whether to print debug spews
 whitenboolean (default = False)
If True, decorrelates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multicollinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
Notes
PCA considers linear combinations of features, specifically those that maximise global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or TSNE for a locally important embedding.
Applications of PCA
PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.
For an additional example see the PCA notebook. For additional docs, see scikitlearn’s PCA.
Examples
# Both import methods supported from cuml import PCA from cuml.decomposition import PCA import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) pca_float = PCA(n_components = 2) pca_float.fit(gdf_float) print(f'components: {pca_float.components_}') print(f'explained variance: {pca_float.explained_variance_}') exp_var = pca_float.explained_variance_ratio_ print(f'explained variance ratio: {exp_var}') print(f'singular values: {pca_float.singular_values_}') print(f'mean: {pca_float.mean_}') print(f'noise variance: {pca_float.noise_variance_}') trans_gdf_float = pca_float.transform(gdf_float) print(f'Inverse: {trans_gdf_float}') input_gdf_float = pca_float.inverse_transform(trans_gdf_float) print(f'Input: {input_gdf_float}')
Output:
components: 0 1 2 0 0.69225764 0.5102837 0.51028395 1 0.72165036 0.48949987 0.4895003 explained variance: 0 8.510402 1 0.48959687 explained variance ratio: 0 0.9456003 1 0.054399658 singular values: 0 4.1256275 1 0.9895422 mean: 0 2.6666667 1 2.3333333 2 2.3333333 noise variance: 0 0.0 transformed matrix: 0 1 0 2.8547091 0.42891636 1 0.121316016 0.80743366 2 2.9760244 0.37851727 Input Matrix: 0 1 2 0 1.0000001 3.9999993 4.0 1 2.0 2.0000002 1.9999999 2 4.9999995 1.0000006 1.0
 Attributes
 components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
 explained_variance_array
How much each component explains the variance in the data given by S**2
 explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
 singular_values_array
The top K singular values. Remember all singular values >= 0
 mean_array
The column wise mean of X. Used to mean  center the data first.
 noise_variance_float
From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.
Methods
fit
(self, X[, _transform])Fit the model with X.
fit_transform
(self, X[, y])Fit the model with X and apply the dimensionality reduction on X.
get_param_names
(self)inverse_transform
(self, X)Transform data back to its original space.
transform
(self, X)Apply dimensionality reduction to X.

fit
(self, X, _transform=False)¶ Fit the model with X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 cluster labels

fit_transform
(self, X, y=None)¶ Fit the model with X and apply the dimensionality reduction on X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
training data (floats or doubles), where n_samples is the number of samples, and n_features is the number of features. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yignored
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)

get_param_names
(self)¶

inverse_transform
(self, X)¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 X_originalcuDF DataFrame, shape (n_samples, n_features)

transform
(self, X)¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)
Truncated SVD¶

class
cuml.
TruncatedSVD
¶ TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.
cuML’s TruncatedSVD an arraylike object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.
 Parameters
 algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
 handlecuml.Handle
If it is None, a new one is created just for this class
 n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
 n_iterint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but slower.
 random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
 tolfloat (default = 1e7)
Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.
 verbosebool
Whether to print debug spews
Notes
TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many many components.
Applications of TruncatedSVD
TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.
For additional examples, see the Truncated SVD notebook. For additional documentation, see scikitlearn’s TruncatedSVD docs.
Examples
# Both import methods supported from cuml import TruncatedSVD from cuml.decomposition import TruncatedSVD import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi", n_iter = 20, tol = 1e9) tsvd_float.fit(gdf_float) print(f'components: {tsvd_float.components_}') print(f'explained variance: {tsvd_float.explained_variance_}') exp_var = tsvd_float.explained_variance_ratio_ print(f'explained variance ratio: {exp_var}') print(f'singular values: {tsvd_float.singular_values_}') trans_gdf_float = tsvd_float.transform(gdf_float) print(f'Transformed matrix: {trans_gdf_float}') input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float) print(f'Input matrix: {input_gdf_float}')
Output:
components: 0 1 2 0 0.58725953 0.57233137 0.5723314 1 0.80939883 0.41525528 0.4152552 explained variance: 0 55.33908 1 16.660923 explained variance ratio: 0 0.7685983 1 0.23140171 singular values: 0 7.439024 1 4.0817795 Transformed Matrix: 0 1 2 0 5.1659107 2.512643 1 3.4638448 0.042223275 2 4.0809603 3.2164836 Input matrix: 0 1 2 0 1.0 4.000001 4.000001 1 2.0000005 2.0000005 2.0000007 2 5.000001 0.9999999 1.0000004
 Attributes
 components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
 explained_variance_array
How much each component explains the variance in the data given by S**2
 explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
 singular_values_array
The top K singular values. Remember all singular values >= 0
Methods
fit
(self, X[, _transform])Fit LSI model on training cudf DataFrame X.
fit_transform
(self, X)Fit LSI model to X and perform dimensionality reduction on X.
get_param_names
(self)inverse_transform
(self, X)Transform X back to its original space.
transform
(self, X)Perform dimensionality reduction on X.

fit
(self, X, _transform=True)¶ Fit LSI model on training cudf DataFrame X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_transform
(self, X)¶ Fit LSI model to X and perform dimensionality reduction on X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)
Reduced version of X as a dense cuDF DataFrame

get_param_names
(self)¶

inverse_transform
(self, X)¶ Transform X back to its original space.
Returns a cuDF DataFrame X_original whose transform would be X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 X_originalcuDF DataFrame, shape (n_samples, n_features)
Note that this is always a dense cuDF DataFrame.

transform
(self, X)¶ Perform dimensionality reduction on X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)
Reduced version of X. This will always be a dense DataFrame.
UMAP¶

class
cuml.
UMAP
¶ Uniform Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates an underlying manifold.
Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py
 Parameters
 n_neighbors: float (optional, default 15)
The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
 n_components: int (optional, default 2)
The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any
 n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
 learning_rate: float (optional, default 1.0)
The initial learning rate for the embedding optimization.
 init: string (optional, default ‘spectral’)
 How to initialize the low dimensional embedding. Options are:
‘spectral’: use a spectral embedding of the fuzzy 1skeleton
‘random’: assign initial embedding positions at random.
 min_dist: float (optional, default 0.1)
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the
spread
value, which determines the scale at which embedded points will be spread out. spread: float (optional, default 1.0)
The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are. set_op_mix_ratio: float (optional, default 1.0)
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product tnorm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
 local_connectivity: int (optional, default 1)
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 repulsion_strength: float (optional, default 1.0)
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
 negative_sample_rate: int (optional, default 5)
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
 transform_queue_size: float (optional, default 4.0)
For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
 a: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
. b: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
. verbose: bool (optional, default False)
Controls verbosity of logging.
Notes
This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap:
Specifying the random seed
Using a noneuclidean distance metric (support for a fixed set of noneuclidean metrics is planned for an upcoming release).
Using a precomputed pairwise distance matrix (under consideration for future releases)
Manual initialization of initial embedding positions
In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.
References
Leland McInnes, John Healy, James Melville UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction https://arxiv.org/abs/1802.03426
Methods
fit
(self, X[, y])Fit X into an embedded space.
fit_transform
(self, X[, y])Fit X into an embedded space and return that transformed output.
transform
(self, X)Transform X into the existing embedded space and return that transformed output.

fit
(self, X, y=None)¶ Fit X into an embedded space. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 yarraylike (device or host) shape = (n_samples, 1)
y contains a label per row. Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_transform
(self, X, y=None)¶ Fit X into an embedded space and return that transformed output. Parameters ——— X : arraylike (device or host) shape = (n_samples, n_features)
X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 X_newarray, shape (n_samples, n_components)
Embedding of the training data in lowdimensional space.

transform
(self, X)¶ Transform X into the existing embedded space and return that transformed output.
Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().
Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
New data to be transformed. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 Returns
 ——
 X_newarray, shape (n_samples, n_components)
Embedding of the new data in lowdimensional space.
Random Projections¶

class
cuml.random_projection.
GaussianRandomProjection
¶ Gaussian Random Projection method derivated from BaseRandomProjection class.
Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.
The components of the random matrix are drawn from N(0, 1 / n_components).
 Parameters
 handlecuml.Handle
If it is None, a new one is created just for this class
 n_componentsint (default = ‘auto’)
Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.
The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.
 epsfloat (default = 0.1)
Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.
 random_stateint (default = None)
Seed used to initilize random generator
Notes
Inspired from sklearn’s implementation : https://scikitlearn.org/stable/modules/random_projection.html
 Attributes
 gaussian_methodboolean
To be passed to base class in order to determine random matrix generation method

class
cuml.random_projection.
SparseRandomProjection
¶ Sparse Random Projection method derivated from BaseRandomProjection class.
Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.
Sparse random matrix is an alternative to dense random projection matrix (e.g. Gaussian) that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data (with sparse enough matrices). If we note ‘s = 1 / density’ the components of the random matrix are drawn from:
sqrt(s) / sqrt(n_components) with probability 1 / 2s
0 with probability 1  1 / s
+sqrt(s) / sqrt(n_components) with probability 1 / 2s
 Parameters
 handlecuml.Handle
If it is None, a new one is created just for this class
 n_componentsint (default = ‘auto’)
Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.
The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.
 densityfloat in range (0, 1] (default = ‘auto’)
Ratio of nonzero component in the random projection matrix.
If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).
 epsfloat (default = 0.1)
Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.
 dense_outputboolean (default = True)
If set to True transformed matrix will be dense otherwise sparse.
 random_stateint (default = None)
Seed used to initilize random generator
Notes
Inspired from sklearn’s implementation : https://scikitlearn.org/stable/modules/random_projection.html
 Attributes
 gaussian_methodboolean
To be passed to base class in order to determine random matrix generation method
Neighbors¶
Nearest Neighbors¶

class
cuml.
NearestNeighbors
¶ NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.
cuML’s KNN an arraylike object or cuDF DataFrame (where automatic chunking will be done in to a Numpy Array in a future release), and fits a special data structure first to approximate the distance calculations, allowing our querying times to be O(plogn) and not the brute force O(np) [where p = no(features)]:
 Parameters
 n_neighbors: int (default = 5)
The top K closest datapoints you want the algorithm to return. If this number is large, then expect the algorithm to run slower.
 should_downcastbool (default = False)
Currently only single precision is supported in the underlying undex. Setting this to true will allow singleprecision input arrays to be automatically downcasted to single precision.
Notes
NearestNeighbors is a generative model. This means the data X has to be stored in order for inference to occur.
Applications of NearestNeighbors
Applications of NearestNeighbors include recommendation systems where content or colloborative filtering is used. Since NearestNeighbors is a relatively simple generative model, it is also used in data visualization and regression / classification tasks.
For an additional example see the NearestNeighbors notebook.
For additional docs, see scikitlearn’s NearestNeighbors.
Examples
import cudf from cuml.neighbors import NearestNeighbors import numpy as np np_float = np.array([ [1,2,3], # Point 1 [1,2,4], # Point 2 [2,2,4] # Point 3 ]).astype('float32') gdf_float = cudf.DataFrame() gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0]) gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1]) gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2]) print('n_samples = 3, n_dims = 3') print(gdf_float) nn_float = NearestNeighbors() nn_float.fit(gdf_float) # get 3 nearest neighbors distances,indices = nn_float.kneighbors(gdf_float,k=3) print(indices) print(distances)
Output:
import cudf # Both import methods supported # from cuml.neighbors import NearestNeighbors from cuml import NearestNeighbors n_samples = 3, n_dims = 3 dim_0 dim_1 dim_2 0 1.0 2.0 3.0 1 1.0 2.0 4.0 2 2.0 2.0 4.0 # indices: index_neighbor_0 index_neighbor_1 index_neighbor_2 0 0 1 2 1 1 0 2 2 2 1 0 # distances: distance_neighbor_0 distance_neighbor_1 distance_neighbor_2 0 0.0 1.0 2.0 1 0.0 1.0 1.0 2 0.0 1.0 2.0
Methods
fit
(self, X)Fit GPU index for performing nearest neighbor queries.
kneighbors
(self, X[, k])Query the GPU index for the k nearest neighbors of column vectors in X.

fit
(self, X)¶ Fit GPU index for performing nearest neighbor queries.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

kneighbors
(self, X, k=None)¶ Query the GPU index for the k nearest neighbors of column vectors in X.
 Parameters
 Xarraylike (device or host) shape = (n_samples, n_features)
Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy
 k: Integer
Number of neighbors to search
 Returns
 distances: cuDF DataFrame or numpy ndarray
The distances of the knearest neighbors for each column vector in X
 indices: cuDF DataFrame of numpy ndarray
The indices of the knearest neighbors for each column vector in X
Time Series¶
Kalman Filter¶

class
cuml.
KalmanFilter
¶ Implements a Kalman filter. You are responsible for setting the various state variables to reasonable values; defaults will not give you a functional filter. After construction the filter will have default matrices created for you, but you must specify the values for each.
 Parameters
 dim_xint
Number of state variables for the Kalman filter. This is used to set the default size of P, Q, and u
 dim_zint
Number of of measurement inputs.
Examples
from cuml import KalmanFilter f = KalmanFilter(dim_x=2, dim_z=1) f.x = np.array([[2.], # position [0.]]) # velocity f.F = np.array([[1.,1.], [0.,1.]]) f.H = np.array([[1.,0.]]) f.P = np.array([[1000., 0.], [ 0., 1000.] ]) f.R = 5
Now just perform the standard predict/update loop:
while some_condition_is_true: z = numba.cuda.to_device(np.array([i]) f.predict() f.update(z)
 Attributes
 xnumba device array, numpy array or cuDF series (dim_x, 1),
Current state estimate. Any call to update() or predict() updates this variable.
 Pnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)
Current state covariance matrix. Any call to update() or predict() updates this variable.
 x_priornumba device array, numpy array or cuDF series(dim_x, 1)
Prior (predicted) state estimate. The *_prior and *_post attributes are for convienence; they store the prior and posterior of the current epoch. Read Only.
 P_priornumba device array, numpy array or cuDF dataframe(dim_x, dim_x)
Prior (predicted) state covariance matrix. Read Only.
 x_postnumba device array, numpy array or cuDF series(dim_x, 1)
Posterior (updated) state estimate. Read Only.
 P_postnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)
Posterior (updated) state covariance matrix. Read Only.
 znumba device array or cuDF series (dim_x, 1)
Last measurement used in update(). Read only.
 Rnumba device array(dim_z, dim_z)
Measurement noise matrix
 Qnumba device array(dim_x, dim_x)
Process noise matrix
 Fnumba device array()
State Transition matrix
 Hnumba device array(dim_z, dim_x)
Measurement function
 ynumba device array
Residual of the update step. Read only.
 Knumba device array(dim_x, dim_z)
Kalman gain of the update step. Read only.
 precision: ‘single’ or ‘double’
Whether the Kalman Filter uses single or double precision
Methods
predict
(self[, B, F, Q])Predict next state (prior) using the Kalman filter state propagation equations.
update
(self, z[, R, H])Add a new measurement (z) to the Kalman filter.

predict
(self, B=None, F=None, Q=None)¶ Predict next state (prior) using the Kalman filter state propagation equations. Parameters ——— u : np.array
Optional control vector. If not None, it is multiplied by B to create the control input into the system.
 Bnp.array(dim_x, dim_z), or None
Optional control transition matrix; a value of None will cause the filter to use self.B.
 Fnp.array(dim_x, dim_x), or None
Optional state transition matrix; a value of None will cause the filter to use self.F.
 Qnp.array(dim_x, dim_x), scalar, or None
Optional process noise matrix; a value of None will cause the filter to use self.Q.

update
(self, z, R=None, H=None)¶ Add a new measurement (z) to the Kalman filter. If z is None, nothing is computed. However, x_post and P_post are updated with the prior (x_prior, P_prior), and self.z is set to None. Parameters ——— z : (dim_z, 1): array_like
measurement for this update. z can be a scalar if dim_z is 1, otherwise it must be convertible to a column vector.
 Rnp.array, scalar, or None
Optionally provide R to override the measurement noise for this one call, otherwise self.R will be used.
 Hnp.array, or None
Optionally provide H to override the measurement function for this one call, otherwise self.H will be used.