cuML API Reference

Preprocessing

Model Selection and Data Splitting

cuml.preprocessing.model_selection.train_test_split(X:cudf.dataframe.dataframe.DataFrame, y:Union[str, cudf.dataframe.series.Series], train_size:Union[float, int]=0.8, shuffle:bool=True, seed:int=None) → Tuple[cudf.dataframe.dataframe.DataFrame, cudf.dataframe.dataframe.DataFrame, cudf.dataframe.dataframe.DataFrame, cudf.dataframe.dataframe.DataFrame]

Partitions the data into four collated dataframes, mimicing sklearn’s train_test_split

Parameters
Xcudf.DataFrame

Data to split, has shape (n_samples, n_features)

ystr or cudf.Series

Set of labels for the data, either a series of shape (n_samples) or the string label of a column in X containing the labels

train_sizefloat or int, optional

If float, represents the proportion [0, 1] of the data to be assigned to the training set. If an int, represents the number of instances to be assigned to the training set. Defaults to 0.8

shufflebool, optional

Whether or not to shuffle inputs before splitting

seedint, optional

If shuffle is true, seeds the generator. Unseeded by default

Returns
X_train, X_test, y_train, y_testcudf.DataFrame

Partitioned dataframes. If y was provided as a column name, the column was dropped from the `X`s

Label Encoding

class cuml.preprocessing.LabelEncoder(*args, **kwargs)

An nvcategory based implementation of ordinal label encoding

Examples

Converting a categorical implementation to a numerical one

Output:

Methods

fit

fit_transform

transform

inverse_transform

fit(self, y:cudf.dataframe.series.Series) → 'LabelEncoder'

Fit a LabelEncoder (nvcategory) instance to a set of categories

ycudf.Series

Series containing the categories to be encoded. It’s elements may or may not be unique

Returns
selfLabelEncoder

A fitted instance of itself to allow method chaining

fit_transform(self, y:cudf.dataframe.series.Series) → cudf.dataframe.series.Series

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) LabelEncoder().fit(y).transform(y)

transform(self, y:cudf.dataframe.series.Series) → cudf.dataframe.series.Series

Transform an input into its categorical keys.

This is intended for use with small inputs relative to the size of the dataset. For fitting and transforming an entire dataset, prefer fit_transform.

Parameters
ycudf.Series

Input keys to be transformed. Its values should match the categories given to fit

Returns
——
encodedcudf.Series

The ordinally encoded input series

Raises
KeyError

if a category appears that was not seen in fit

Regression and Classification

Linear Regression

class cuml.LinearRegression

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much faster.

Parameters
algorithm‘eig’ or ‘svd’ (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable.

fit_interceptboolean (default = True)

If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Notes

LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem, and consider maybe first DBSCAN to remove the outliers, or statistical analysis to filter possible outliers.

Applications of LinearRegression

LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).

For additional docs, see scikitlearn’s OLS.

For an additional example see the OLS notebook.

Examples

import numpy as np
import cudf

# Both import methods supported
from cuml import LinearRegression
from cuml.linear_model import LinearRegression

lr = LinearRegression(fit_intercept = True, normalize = False,
                      algorithm = "eig")

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

reg = lr.fit(X,y)
print("Coefficients:")
print(reg.coef_)
print("Intercept:")
print(reg.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = lr.predict(X_new)

print("Predictions:")
print(preds)

Output:

Coefficients:

            0 1.0000001
            1 1.9999998

Intercept:
            3.0

Predictions:

            0 15.999999
            1 14.999999
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y)

Fit the model with X and y.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predicts the y for X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Logistic Regression

class cuml.LogisticRegression

LogisticRegression is a linear model that is used to model probability of occurrence of certain events, for example probability of success or fail of an event.

cuML’s LogisticRegression can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant). It provides both single-class (using sigmoid loss) and multiple-class (using softmax loss) variants, depending on the input variables.

Only one solver option is currently available: Quasi-Newton (QN) algorithms. Even though it is presented as a single option, this solver resolves to two different algorithms underneath:

  • Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1

regularization - Limited Memory BFGS (L-BFGS) otherwise.

Note that, just like in Scikit-learn, the bias will not be regularized.

Parameters
penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘l2’)

Used to specify the norm used in the penalization. If ‘none’ or ‘l2’ are selected, then L-BFGS solver will be used. If ‘l1’ is selected, solver OWL-QN will be used. If ‘elasticnet’ is selected, OWL-QN will be used if l1_ratio > 0, otherwise L-BFGS will be used.

tol: float (default = 1e-4)

The training process will stop if current_loss > previous_loss - tol

C: float (default = 1.0)

Inverse of regularization strength; must be a positive float.

fit_intercept: boolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

class_weight: None

Custom class weighs are currently not supported.

max_iter: int (default = 1000)

Maximum number of iterations taken for the solvers to converge.

verbose: bool (optional, default False)

Controls verbosity of logging.

l1_ratio: float or None, optional (default=None)

The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1

solver: ‘qn’, ‘lbfgs’, ‘owl’ (default=qn).

Algorithm to use in the optimization problem. Currently only qn is supported, which automatically selects either L-BFGS or OWL-QN depending on the condictions of the l1 regularization described above. Options ‘lbfgs’ and ‘owl’ are just convenience values that end up using the same solver following the same rules.

Notes

cuML’s LogisticRegression uses a different solver that the equivalent Scikit-learn except when there is no penalty and solver=lbfgs is chosen in Scikit-learn. This can cause (smaller) differences in the coefficients and predictions of the model, similar to difference when using different solvers in Scikit-learn.

For additional docs, see Scikit-learn’s LogistRegression <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_.

Examples

import cudf
import numpy as np

# Both import methods supported
# from cuml import LogisticRegression
from cuml.linear_model import LogisticRegression

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) )

reg = LogisticRegression()
reg.fit(X,y)

print("Coefficients:")
print(reg.coef_.copy_to_host())
print("Intercept:")
print(reg.intercept_.copy_to_host())

X_new = cudf.DataFrame()
X_new['col1'] = np.array([1,5], dtype = np.float32)
X_new['col2'] = np.array([2,5], dtype = np.float32)

preds = reg.predict(X_new)

print("Predictions:")
print(preds)

Output:

Attributes
coef_: device array, shape (n_classes, n_features)

The estimated coefficients for the linear regression model.

intercept_: device array (n_classes, 1)

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X)

Predicts the y for X.

fit(self, X, y)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict(self, X)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Ridge Regression

class cuml.Ridge

Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.

cuML’s Ridge an array-like object or cuDF DataFrame, and provides 3 algorithms: SVD, Eig and CD to fit a linear model. SVD is more stable, but Eig (default) is much faster. CD uses Coordinate Descent and can be faster when data is large.

Parameters
alphafloat or double

Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.

solver‘eig’ or ‘svd’ or ‘cd’ (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.

fit_interceptboolean (default = True)

If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Notes

Ridge provides L2 regularization. This means that the coefficients can shrink to become very small, but not zero. This can cause issues of interpretabiliy on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.

Applications of Ridge

Ridge Regression is used in the same way as LinearRegression, but is used frequently as it does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.

For additional docs, see scikitlearn’s Ridge.

Examples

import numpy as np
import cudf

# Both import methods supported
from cuml import Ridge
from cuml.linear_model import Ridge

alpha = np.array([1e-5])
ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False,
              solver = "eig")

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

result_ridge = ridge.fit(X, y)
print("Coefficients:")
print(result_ridge.coef_)
print("Intercept:")
print(result_ridge.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_ridge.predict(X_new)

print("Predictions:")
print(preds)

Output:

Coefficients:

            0 1.0000001
            1 1.9999998

Intercept:
            3.0

Preds:

            0 15.999999
            1 14.999999
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Lasso Regression

class cuml.Lasso

Lasso extends LinearRegression by providing L1 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can zero some of the coefficients for feature selection, and improves the conditioning of the problem.

cuML’s Lasso an array-like object or cuDF DataFrame, and uses coordinate descent to fit a linear model.

Parameters
alphafloat or double

Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression class. For numerical reasons, using alpha = 0 with the Lasso class is not advised. Given this, you should use the LinearRegression class.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

max_iterint

The maximum number of iterations

tolfloat, optional

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

selectionstr, default ‘cyclic’

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

Examples

import numpy as np
import cudf
from cuml.linear_model import Lasso

ls = Lasso(alpha = 0.1)

X = cudf.DataFrame()
X['col1'] = np.array([0, 1, 2], dtype = np.float32)
X['col2'] = np.array([0, 1, 2], dtype = np.float32)

y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) )

result_lasso = ls.fit(X, y)
print("Coefficients:")
print(result_lasso.coef_)
print("intercept:")
print(result_lasso.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_lasso.predict(X_new)

print(preds)

Output:

Coefficients:

            0 0.85
            1 0.0

Intercept:
            0.149999

Preds:

            0 2.7
            1 1.85
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

For additional docs, see `scikitlearn’s Lasso
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`_.

Methods

fit(self, X, y)

Fit the model with X and y.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

ElasticNet Regression

class cuml.ElasticNet

ElasticNet extends LinearRegression with combined L1 and L2 regularizations on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, force some coefficients to be smaell, and improves the conditioning of the problem.

cuML’s ElasticNet an array-like object or cuDF DataFrame, uses coordinate descent to fit a linear model.

Parameters
alphafloat or double

Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.

l1_ratio: The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1.

For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

fit_interceptboolean (default = True)

If True, Lasso tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

max_iterint

The maximum number of iterations

tolfloat, optional

The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.

selectionstr, default ‘cyclic’

If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.

Examples

import numpy as np
import cudf
from cuml.linear_model import ElasticNet

enet = ElasticNet(alpha = 0.1, l1_ratio=0.5)

X = cudf.DataFrame()
X['col1'] = np.array([0, 1, 2], dtype = np.float32)
X['col2'] = np.array([0, 1, 2], dtype = np.float32)

y = cudf.Series( np.array([0.0, 1.0, 2.0], dtype = np.float32) )

result_enet = enet.fit(X, y)
print("Coefficients:")
print(result_enet.coef_)
print("intercept:")
print(result_enet.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_enet.predict(X_new)

print(preds)

Output:

Coefficients:

            0 0.448408
            1 0.443341

Intercept:
            0.1082506

Preds:

            0 3.67018
            1 3.22177
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

For additional docs, see `scikitlearn’s ElasticNet
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html>`_.

Methods

fit(self, X, y)

Fit the model with X and y.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y)

Fit the model with X and y.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predicts the y for X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Stochastic Gradient Descent

class cuml.SGD

Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.

cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.

Parameters
loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)

‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression

penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)

‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.0)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate‘optimal’, ‘constant’, ‘invscaling’,

‘adaptive’ (default = ‘constant’)

optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

Notes
——
For additional docs, see `scikitlearn’s OLS
<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>

Examples

Output: .. code-block:: python

cuML intercept : 0.004561662673950195 cuML coef : 0 0.9834546

1 0.010128272

dtype: float32

cuML predictions : [3.0055666 2.0221121]

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X)

Predicts the y for X.

predictClass(self, X)

Predicts the y for X.

fit(self, X, y)

Fit the model with X and y. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict(self, X)

Predicts the y for X. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

predictClass(self, X)

Predicts the y for X. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Random Forest

class cuml.ensemble.RandomForestClassifier

Implements a Random Forest classifier model which fits multiple decision tree classifiers in an ensemble.

Note that the underlying algorithm for tree node splits differs from that used in scikit-learn. By default, the cuML Random Forest uses a histogram-based algorithms to determine splits, rather than an exact count. You can tune the size of the histograms with the n_bins parameter.

Known Limitations: This is an initial preview release of the cuML Random Forest code. It contains a number of known limitations:

  • Only classification is supported. Regression support is planned for the next release.

  • The implementation relies on limited CUDA shared memory for scratch space, so models with a very large number of features or bins will generate a memory limit exception. This limitation will be lifted in the next release.

  • Inference/prediction takes place on the CPU. A GPU-based inference solution is planned for a near-future release release.

  • Instances of RandomForestClassifier cannot be pickled currently.

The code is under heavy development, so users who need these features may wish to pull from nightly builds of cuML. (See https://rapids.ai/start.html for instructions to download nightly packages via conda.)

Parameters
n_estimatorsint (default = 10)

number of trees in the forest.

handlecuml.Handle

If it is None, a new one is created just for this class.

split_algo0 for HIST and 1 for GLOBAL_QUANTILE

(default = 0) the algorithm to determine how nodes are split in the tree.

bootstrapboolean (default = True)

Control bootstrapping. If set, each tree in the forest is built on a bootstrapped sample with replacement. If false, sampling without replacement is done.

bootstrap_featuresboolean (default = False)

Control bootstrapping for features. If features are drawn with or without replacement

rows_samplefloat (default = 1.0)

Ratio of dataset rows used while fitting each tree.

max_depthint (default = -1)

Maximum tree depth. Unlimited (i.e, until leaves are pure), if -1.

max_leavesint (default = -1)

Maximum leaf nodes per tree. Soft constraint. Unlimited, if -1.

max_featuresfloat (default = 1.0)

Ratio of number of features (columns) to consider per node split.

n_binsint (default = 8)

Number of bins used by the split algorithm.

min_rows_per_nodeint (default = 2)

The minimum number of samples (rows) needed to split a node.

Examples

import numpy as np
from cuml.ensemble import RandomForestClassifier as cuRFC

X = np.random.normal(size=(10,4)).astype(np.float32)
y = np.asarray([0,1]*5, dtype=np.int32)

cuml_model = cuRFC(max_features=1.0,
                   n_bins=8,
                   n_estimators=40)
cuml_model.fit(X,y)
cuml_predict = cuml_model.predict(X)

print("Predicted labels : ", cuml_predict)

Output:

Predicted labels :  [0 1 0 1 0 1 0 1 0 1]

Methods

fit(self, X, y)

Perform Random Forest Classification on the input data

get_params(self[, deep])

Returns the value of all parameters required to configure this estimator as a dictionary.

predict(self, X)

Predicts the labels for X.

score(self, X, y)

Predicts the accuracy of the model for X.

set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params.

fit(self, X, y)

Perform Random Forest Classification on the input data

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (int32) of shape (n_samples, 1). Acceptable formats: NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy These labels should be contiguous integers from 0 to n_classes.

get_params(self, deep=True)

Returns the value of all parameters required to configure this estimator as a dictionary. Parameters ———– deep : boolean (default = True)

predict(self, X)

Predicts the labels for X.

Parameters
Xarray-like (host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: NumPy ndarray, Numba device ndarray

Returns
y: NumPy

Dense vector (int) of shape (n_samples, 1)

score(self, X, y)

Predicts the accuracy of the model for X.

Parameters
Xarray-like (host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: NumPy ndarray, Numba device ndarray

y: NumPy

Dense vector (int) of shape (n_samples, 1)

Returns
accuracyfloat
set_params(self, **params)

Sets the value of parameters required to configure this estimator, it functions similar to the sklearn set_params. Parameters ———– params : dict of new params

Quasi-Newton

class cuml.QN

Quasi-Newton methods are used to either find zeroes or local maxima and minima of functions, and used by this class to optimize a cost function.

Two algorithms are implemented underneath cuML’s QN class, and which one is executed depends on the following rule:

  • Orthant-Wise Limited Memory Quasi-Newton (OWL-QN) if there is l1

regularization - Limited Memory BFGS (L-BFGS) otherwise.

cuML’s QN class can take array-like objects, either in host as NumPy arrays or in device (as Numba or __cuda_array_interface__ compliant).

Parameters
loss: ‘sigmoid’, ‘softmax’, ‘squared_loss’ (default = ‘squared_loss’)

‘sigmoid’ loss used for single class logistic regression ‘softmax’ loss used for multiclass logistic regression ‘normal’ used for normal/square loss

fit_intercept: boolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

l1_strength: float (default = 0.0)

l1 regularization strength (if non-zero, will run OWL-QN, else L-BFGS). Note, that as in Scikit-learn, the bias will not be regularized.

l2_strength: float (default = 0.0)

l2 regularization strength. Note, that as in Scikit-learn, the bias will not be regularized.

max_iter: int (default = 1000)

Maximum number of iterations taken for the solvers to converge.

tol: float (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

linesearch_max_iter: int (default = 50)

Max number of linesearch iterations per outer iteration of the algorithm.

lbfgs_memory: int (default = 5)

Rank of the lbfgs inverse-Hessian approximation. Method will use O(lbfgs_memory * D) memory.

verbose: bool (optional, default False)

Controls verbosity of logging.

Notes

This class contains implementations of two popular Quasi-Newton methods: - Limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) [Nocedal, Wright - Numerical Optimization (1999)] - Orthant-wise limited-memory quasi-newton (OWL-QN) [Andrew, Gao - ICML 2007] <https://www.microsoft.com/en-us/research/publication/scalable-training-of-l1-regularized-log-linear-models/>

Examples

import cudf
import numpy as np

# Both import methods supported
# from cuml import QN
from cuml.solvers import QN

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series( np.array([0.0, 0.0, 1.0, 1.0], dtype = np.float32) )

solver = QN()
solver.fit(X,y)

# Note: for now, the coefficients also include the intercept in the
# last position if fit_intercept=True
print("Coefficients:")
print(solver.coef_.copy_to_host())
print("Intercept:")
print(solver.intercept_.copy_to_host())

X_new = cudf.DataFrame()
X_new['col1'] = np.array([1,5], dtype = np.float32)
X_new['col2'] = np.array([2,5], dtype = np.float32)

preds = solver.predict(X_new)

print("Predictions:")
print(preds)

Output:

Attributes
coef_array, shape (n_classes, n_features)

The estimated coefficients for the linear regression model. Note: shape is (n_classes, n_features + 1) if fit_intercept = True.

intercept_array (n_classes, 1)

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X)

Predicts the y for X.

fit(self, X, y)

Fit the model with X and y. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

Dense vector (floats or doubles) of shape (n_samples, 1). Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

predict(self, X)

Predicts the y for X. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Clustering

K-Means Clustering

class cuml.KMeans

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomnly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.

cuML’s KMeans expects an array-like object or cuDF DataFrame, and supports the scalable KMeans++ intialization method. This method is more stable than randomnly selecting K points.

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class.

n_clustersint (default = 8)

The number of centroids or clusters you want.

max_iterint (default = 300)

The more iterations of EM, the more accurate, but slower.

tolfloat (default = 1e-4)

Stopping criterion when centroid means do not change much.

verboseboolean (default = 0)

If True, prints diagnositc information.

random_stateint (default = 1)

If you want results to be the same when you restart Python, select a state.

precompute_distancesboolean (default = ‘auto’)

Not supported yet.

init{‘scalable-kmeans++’, ‘k-means||’ , ‘random’ or an ndarray}

(default = ‘scalable-k-means++’)

‘scalable-k-means++’ or ‘k-means||’: Uses fast and stable scalable kmeans++ intialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_initint (default = 1)

Number of times intialization is run. More is slower, but can be better.

algorithm“auto”

Currently uses full EM, but will support others later.

n_gpuint (default = 1)

Number of GPUs to use. Currently uses single GPU, but will support multiple GPUs later.

Notes

KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or T-SNE, and verify that they look appropriate.

Applications of KMeans

The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.

For additional docs, see scikitlearn’s Kmeans.

Examples

# Both import methods supported
from cuml import KMeans
from cuml.cluster import KMeans

import cudf
import numpy as np
import pandas as pd

def np2cudf(df):
    # convert numpy array to cuDF dataframe
    df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
    pdf = cudf.DataFrame()
    for c,column in enumerate(df):
      pdf[str(c)] = df[column]
    return pdf

a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
               dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)

print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Output:

input:

     0    1
 0  1.0  1.0
 1  1.0  2.0
 2  3.0  2.0
 3  4.0  3.0

Calling fit

labels:

   0    0
   1    0
   2    1
   3    1

cluster_centers:

   0    1
0  1.0  1.5
1  3.5  2.5
Attributes
cluster_centers_array

The coordinates of the final clusters. This represents of “mean” of each data cluster.

labels_array

Which cluster each datapoint belongs to.

Methods

fit(self, X)

Compute k-means clustering with X.

fit_predict(self, X)

Compute cluster centers and predict cluster index for each sample.

fit_transform(self, X)

Compute clustering and transform X to cluster-distance space.

get_params(self[, deep])

Scikit-learn style return parameter state

predict(self, X)

Predict the closest cluster each sample in X belongs to.

set_params(self, **params)

Scikit-learn style set parameter state to dictionary of params.

transform(self, X)

Transform X to a cluster-distance space.

fit(self, X)

Compute k-means clustering with X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_predict(self, X)

Compute cluster centers and predict cluster index for each sample.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_transform(self, X)

Compute clustering and transform X to cluster-distance space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

get_params(self, deep=True)

Scikit-learn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predict the closest cluster each sample in X belongs to.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

set_params(self, **params)

Scikit-learn style set parameter state to dictionary of params.

Parameters
paramsdict of new params
transform(self, X)

Transform X to a cluster-distance space.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

DBSCAN

class cuml.DBSCAN

DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.

cuML’s DBSCAN expects an array-like object or cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.

Parameters
epsfloat (default = 0.5)

The maximum distance between 2 points such they reside in the same neighborhood.

handlecuml.Handle

If it is None, a new one is created just for this class

min_samplesint (default = 5)

The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).

verbosebool

Whether to print debug spews

max_bytes_per_batch(optional) int64

Calculate batch size using no more than this number of bytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.

Notes

DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.

Applications of DBSCAN

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisons in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.

For an additional example, see the DBSCAN notebook. For additional docs, see scikitlearn’s DBSCAN.

Examples

# Both import methods supported
from cuml import DBSCAN
from cuml.cluster import DBSCAN

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

dbscan_float = DBSCAN(eps = 1.0, min_samples = 1)
dbscan_float.fit(gdf_float)
print(dbscan_float.labels_)

Output:

0    0
1    1
2    2
Attributes
labels_array

Which cluster each datapoint belongs to. Noisy samples are labeled as -1.

Methods

fit(self, X)

Perform DBSCAN clustering from features.

fit_predict(self, X)

Performs clustering on input_gdf and returns cluster labels.

get_param_names(self)

fit(self, X)

Perform DBSCAN clustering from features.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_predict(self, X)

Performs clustering on input_gdf and returns cluster labels.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
ycuDF Series, shape (n_samples)

cluster labels

get_param_names(self)

Dimensionality Reduction and Manifold Learning

Principal Component Analysis

class cuml.PCA

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s PCA expects an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters
copyboolean (default = True)

If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.

handlecuml.Handle

If it is None, a new one is created just for this class

iterated_powerint (default = 15)

Used in Jacobi solver. The more iterations, the more accurate, but slower.

n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

random_stateint / None (default = None)

If you want results to be the same when you restart Python, select a state.

svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)

Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.

tolfloat (default = 1e-7)

Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.

verbosebool

Whether to print debug spews

whitenboolean (default = False)

If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Notes

PCA considers linear combinations of features, specifically those that maximise global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For an additional example see the PCA notebook. For additional docs, see scikitlearn’s PCA.

Examples

# Both import methods supported
from cuml import PCA
from cuml.decomposition import PCA

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

pca_float = PCA(n_components = 2)
pca_float.fit(gdf_float)

print(f'components: {pca_float.components_}')
print(f'explained variance: {pca_float.explained_variance_}')
exp_var = pca_float.explained_variance_ratio_
print(f'explained variance ratio: {exp_var}')

print(f'singular values: {pca_float.singular_values_}')
print(f'mean: {pca_float.mean_}')
print(f'noise variance: {pca_float.noise_variance_}')

trans_gdf_float = pca_float.transform(gdf_float)
print(f'Inverse: {trans_gdf_float}')

input_gdf_float = pca_float.inverse_transform(trans_gdf_float)
print(f'Input: {input_gdf_float}')

Output:

components:
            0           1           2
            0  0.69225764  -0.5102837 -0.51028395
            1 -0.72165036 -0.48949987  -0.4895003

explained variance:

            0   8.510402
            1 0.48959687

explained variance ratio:

             0   0.9456003
             1 0.054399658

singular values:

           0 4.1256275
           1 0.9895422

mean:

          0 2.6666667
          1 2.3333333
          2 2.3333333

noise variance:

      0  0.0

transformed matrix:
             0           1
             0   -2.8547091 -0.42891636
             1 -0.121316016  0.80743366
             2    2.9760244 -0.37851727

Input Matrix:
          0         1         2
          0 1.0000001 3.9999993       4.0
          1       2.0 2.0000002 1.9999999
          2 4.9999995 1.0000006       1.0
Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

mean_array

The column wise mean of X. Used to mean - center the data first.

noise_variance_float

From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

fit(self, X[, _transform])

Fit the model with X.

fit_transform(self, X[, y])

Fit the model with X and apply the dimensionality reduction on X.

get_param_names(self)

inverse_transform(self, X)

Transform data back to its original space.

transform(self, X)

Apply dimensionality reduction to X.

fit(self, X, _transform=False)

Fit the model with X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
cluster labels
fit_transform(self, X, y=None)

Fit the model with X and apply the dimensionality reduction on X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

training data (floats or doubles), where n_samples is the number of samples, and n_features is the number of features. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yignored
Returns
X_newcuDF DataFrame, shape (n_samples, n_components)
get_param_names(self)
inverse_transform(self, X)

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
X_originalcuDF DataFrame, shape (n_samples, n_features)
transform(self, X)

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Truncated SVD

class cuml.TruncatedSVD

TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.

cuML’s TruncatedSVD an array-like object or cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.

Parameters
algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)

Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.

handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

n_iterint (default = 15)

Used in Jacobi solver. The more iterations, the more accurate, but slower.

random_stateint / None (default = None)

If you want results to be the same when you restart Python, select a state.

tolfloat (default = 1e-7)

Used if algorithm = “jacobi”. Smaller tolerance can increase accuracy, but but will slow down the algorithm’s convergence.

verbosebool

Whether to print debug spews

Notes

TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many many components.

Applications of TruncatedSVD

TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.

For additional examples, see the Truncated SVD notebook. For additional documentation, see scikitlearn’s TruncatedSVD docs.

Examples

# Both import methods supported
from cuml import TruncatedSVD
from cuml.decomposition import TruncatedSVD

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi",
                          n_iter = 20, tol = 1e-9)
tsvd_float.fit(gdf_float)

print(f'components: {tsvd_float.components_}')
print(f'explained variance: {tsvd_float.explained_variance_}')
exp_var = tsvd_float.explained_variance_ratio_
print(f'explained variance ratio: {exp_var}')
print(f'singular values: {tsvd_float.singular_values_}')

trans_gdf_float = tsvd_float.transform(gdf_float)
print(f'Transformed matrix: {trans_gdf_float}')

input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float)
print(f'Input matrix: {input_gdf_float}')

Output:

components:            0           1          2
0 0.58725953  0.57233137  0.5723314
1 0.80939883 -0.41525528 -0.4152552
explained variance:
0  55.33908
1 16.660923

explained variance ratio:
0  0.7685983
1 0.23140171

singular values:
0  7.439024
1 4.0817795

Transformed Matrix:
0           1         2
0   5.1659107    -2.512643
1   3.4638448    -0.042223275
2    4.0809603   3.2164836

Input matrix:           0         1         2
0       1.0  4.000001  4.000001
1 2.0000005 2.0000005 2.0000007
2  5.000001 0.9999999 1.0000004
Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

Methods

fit(self, X[, _transform])

Fit LSI model on training cudf DataFrame X.

fit_transform(self, X)

Fit LSI model to X and perform dimensionality reduction on X.

get_param_names(self)

inverse_transform(self, X)

Transform X back to its original space.

transform(self, X)

Perform dimensionality reduction on X.

fit(self, X, _transform=True)

Fit LSI model on training cudf DataFrame X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_transform(self, X)

Fit LSI model to X and perform dimensionality reduction on X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Reduced version of X as a dense cuDF DataFrame

get_param_names(self)
inverse_transform(self, X)

Transform X back to its original space.

Returns a cuDF DataFrame X_original whose transform would be X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
X_originalcuDF DataFrame, shape (n_samples, n_features)

Note that this is always a dense cuDF DataFrame.

transform(self, X)

Perform dimensionality reduction on X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Reduced version of X. This will always be a dense DataFrame.

UMAP

class cuml.UMAP

Uniform Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates an underlying manifold.

Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py

Parameters
n_neighbors: float (optional, default 15)

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_components: int (optional, default 2)

The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any

n_epochs: int (optional, default None)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

learning_rate: float (optional, default 1.0)

The initial learning rate for the embedding optimization.

init: string (optional, default ‘spectral’)
How to initialize the low dimensional embedding. Options are:
  • ‘spectral’: use a spectral embedding of the fuzzy 1-skeleton

  • ‘random’: assign initial embedding positions at random.

min_dist: float (optional, default 0.1)

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spread: float (optional, default 1.0)

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

set_op_mix_ratio: float (optional, default 1.0)

Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity: int (optional, default 1)

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strength: float (optional, default 1.0)

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rate: int (optional, default 5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size: float (optional, default 4.0)

For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

b: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

verbose: bool (optional, default False)

Controls verbosity of logging.

Notes

This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap:

  • Specifying the random seed

  • Using a non-euclidean distance metric (support for a fixed set of non-euclidean metrics is planned for an upcoming release).

  • Using a pre-computed pairwise distance matrix (under consideration for future releases)

  • Manual initialization of initial embedding positions

In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.

References

Methods

fit(self, X[, y])

Fit X into an embedded space.

fit_transform(self, X[, y])

Fit X into an embedded space and return that transformed output.

transform(self, X)

Transform X into the existing embedded space and return that transformed output.

fit(self, X, y=None)

Fit X into an embedded space. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

yarray-like (device or host) shape = (n_samples, 1)

y contains a label per row. Acceptable formats: cuDF Series, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

fit_transform(self, X, y=None)

Fit X into an embedded space and return that transformed output. Parameters ———- X : array-like (device or host) shape = (n_samples, n_features)

X contains a sample per row. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

X_newarray, shape (n_samples, n_components)

Embedding of the training data in low-dimensional space.

transform(self, X)

Transform X into the existing embedded space and return that transformed output.

Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().

Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

New data to be transformed. Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

Returns
——-
X_newarray, shape (n_samples, n_components)

Embedding of the new data in low-dimensional space.

Random Projections

class cuml.random_projection.GaussianRandomProjection

Gaussian Random Projection method derivated from BaseRandomProjection class.

Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.

The components of the random matrix are drawn from N(0, 1 / n_components).

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = ‘auto’)

Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.

The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.

epsfloat (default = 0.1)

Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.

random_stateint (default = None)

Seed used to initilize random generator

Notes

Inspired from sklearn’s implementation : https://scikit-learn.org/stable/modules/random_projection.html

Attributes
gaussian_methodboolean

To be passed to base class in order to determine random matrix generation method

class cuml.random_projection.SparseRandomProjection

Sparse Random Projection method derivated from BaseRandomProjection class.

Random projection is a dimensionality reduction technique. Random projection methods are powerful methods known for their simplicity, computational efficiency and restricted model size. This algorithm also has the advantage to preserve distances well between any two samples and is thus suitable for methods having this requirement.

Sparse random matrix is an alternative to dense random projection matrix (e.g. Gaussian) that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data (with sparse enough matrices). If we note ‘s = 1 / density’ the components of the random matrix are drawn from:

  • -sqrt(s) / sqrt(n_components) with probability 1 / 2s

  • 0 with probability 1 - 1 / s

  • +sqrt(s) / sqrt(n_components) with probability 1 / 2s

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class

n_componentsint (default = ‘auto’)

Dimensionality of the target projection space. If set to ‘auto’, the parameter is deducted thanks to Johnson–Lindenstrauss lemma. The automatic deduction make use of the number of samples and the eps parameter.

The Johnson–Lindenstrauss lemma can produce very conservative n_components parameter as it makes no assumption on dataset structure.

densityfloat in range (0, 1] (default = ‘auto’)

Ratio of non-zero component in the random projection matrix.

If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).

epsfloat (default = 0.1)

Error tolerance during projection. Used by Johnson–Lindenstrauss automatic deduction when n_components is set to ‘auto’.

dense_outputboolean (default = True)

If set to True transformed matrix will be dense otherwise sparse.

random_stateint (default = None)

Seed used to initilize random generator

Notes

Inspired from sklearn’s implementation : https://scikit-learn.org/stable/modules/random_projection.html

Attributes
gaussian_methodboolean

To be passed to base class in order to determine random matrix generation method

Neighbors

Nearest Neighbors

class cuml.NearestNeighbors

NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.

cuML’s KNN an array-like object or cuDF DataFrame (where automatic chunking will be done in to a Numpy Array in a future release), and fits a special data structure first to approximate the distance calculations, allowing our querying times to be O(plogn) and not the brute force O(np) [where p = no(features)]:

Parameters
n_neighbors: int (default = 5)

The top K closest datapoints you want the algorithm to return. If this number is large, then expect the algorithm to run slower.

should_downcastbool (default = False)

Currently only single precision is supported in the underlying undex. Setting this to true will allow single-precision input arrays to be automatically downcasted to single precision.

Notes

NearestNeighbors is a generative model. This means the data X has to be stored in order for inference to occur.

Applications of NearestNeighbors

Applications of NearestNeighbors include recommendation systems where content or colloborative filtering is used. Since NearestNeighbors is a relatively simple generative model, it is also used in data visualization and regression / classification tasks.

For an additional example see the NearestNeighbors notebook.

For additional docs, see scikitlearn’s NearestNeighbors.

Examples

import cudf
from cuml.neighbors import NearestNeighbors
import numpy as np

np_float = np.array([
  [1,2,3], # Point 1
  [1,2,4], # Point 2
  [2,2,4]  # Point 3
]).astype('float32')

gdf_float = cudf.DataFrame()
gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0])
gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1])
gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2])

print('n_samples = 3, n_dims = 3')
print(gdf_float)

nn_float = NearestNeighbors()
nn_float.fit(gdf_float)
# get 3 nearest neighbors
distances,indices = nn_float.kneighbors(gdf_float,k=3)

print(indices)
print(distances)

Output:

import cudf

# Both import methods supported
# from cuml.neighbors import NearestNeighbors
from cuml import NearestNeighbors

n_samples = 3, n_dims = 3

dim_0 dim_1 dim_2

0   1.0   2.0   3.0
1   1.0   2.0   4.0
2   2.0   2.0   4.0

# indices:

         index_neighbor_0 index_neighbor_1 index_neighbor_2
0                0                1                2
1                1                0                2
2                2                1                0
# distances:

         distance_neighbor_0 distance_neighbor_1 distance_neighbor_2
0                 0.0                 1.0                 2.0
1                 0.0                 1.0                 1.0
2                 0.0                 1.0                 2.0

Methods

fit(self, X)

Fit GPU index for performing nearest neighbor queries.

kneighbors(self, X[, k])

Query the GPU index for the k nearest neighbors of column vectors in X.

fit(self, X)

Fit GPU index for performing nearest neighbor queries.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

kneighbors(self, X, k=None)

Query the GPU index for the k nearest neighbors of column vectors in X.

Parameters
Xarray-like (device or host) shape = (n_samples, n_features)

Dense matrix (floats or doubles) of shape (n_samples, n_features). Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

k: Integer

Number of neighbors to search

Returns
distances: cuDF DataFrame or numpy ndarray

The distances of the k-nearest neighbors for each column vector in X

indices: cuDF DataFrame of numpy ndarray

The indices of the k-nearest neighbors for each column vector in X

Time Series

Kalman Filter

class cuml.KalmanFilter

Implements a Kalman filter. You are responsible for setting the various state variables to reasonable values; defaults will not give you a functional filter. After construction the filter will have default matrices created for you, but you must specify the values for each.

Parameters
dim_xint

Number of state variables for the Kalman filter. This is used to set the default size of P, Q, and u

dim_zint

Number of of measurement inputs.

Examples

from cuml import KalmanFilter
f = KalmanFilter(dim_x=2, dim_z=1)
f.x = np.array([[2.],    # position
                [0.]])   # velocity
f.F = np.array([[1.,1.], [0.,1.]])
f.H = np.array([[1.,0.]])
f.P = np.array([[1000., 0.], [   0., 1000.] ])
f.R = 5

Now just perform the standard predict/update loop:

while some_condition_is_true:
    z = numba.cuda.to_device(np.array([i])
    f.predict()
    f.update(z)
Attributes
xnumba device array, numpy array or cuDF series (dim_x, 1),

Current state estimate. Any call to update() or predict() updates this variable.

Pnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)

Current state covariance matrix. Any call to update() or predict() updates this variable.

x_priornumba device array, numpy array or cuDF series(dim_x, 1)

Prior (predicted) state estimate. The *_prior and *_post attributes are for convienence; they store the prior and posterior of the current epoch. Read Only.

P_priornumba device array, numpy array or cuDF dataframe(dim_x, dim_x)

Prior (predicted) state covariance matrix. Read Only.

x_postnumba device array, numpy array or cuDF series(dim_x, 1)

Posterior (updated) state estimate. Read Only.

P_postnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)

Posterior (updated) state covariance matrix. Read Only.

znumba device array or cuDF series (dim_x, 1)

Last measurement used in update(). Read only.

Rnumba device array(dim_z, dim_z)

Measurement noise matrix

Qnumba device array(dim_x, dim_x)

Process noise matrix

Fnumba device array()

State Transition matrix

Hnumba device array(dim_z, dim_x)

Measurement function

ynumba device array

Residual of the update step. Read only.

Knumba device array(dim_x, dim_z)

Kalman gain of the update step. Read only.

precision: ‘single’ or ‘double’

Whether the Kalman Filter uses single or double precision

Methods

predict(self[, B, F, Q])

Predict next state (prior) using the Kalman filter state propagation equations.

update(self, z[, R, H])

Add a new measurement (z) to the Kalman filter.

predict(self, B=None, F=None, Q=None)

Predict next state (prior) using the Kalman filter state propagation equations. Parameters ———- u : np.array

Optional control vector. If not None, it is multiplied by B to create the control input into the system.

Bnp.array(dim_x, dim_z), or None

Optional control transition matrix; a value of None will cause the filter to use self.B.

Fnp.array(dim_x, dim_x), or None

Optional state transition matrix; a value of None will cause the filter to use self.F.

Qnp.array(dim_x, dim_x), scalar, or None

Optional process noise matrix; a value of None will cause the filter to use self.Q.

update(self, z, R=None, H=None)

Add a new measurement (z) to the Kalman filter. If z is None, nothing is computed. However, x_post and P_post are updated with the prior (x_prior, P_prior), and self.z is set to None. Parameters ———- z : (dim_z, 1): array_like

measurement for this update. z can be a scalar if dim_z is 1, otherwise it must be convertible to a column vector.

Rnp.array, scalar, or None

Optionally provide R to override the measurement noise for this one call, otherwise self.R will be used.

Hnp.array, or None

Optionally provide H to override the measurement function for this one call, otherwise self.H will be used.