cuML API Reference

Linear Regression

class cuml.LinearRegression

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much more faster.

Parameters
algorithm‘eig’ or ‘svd’ (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable.

fit_interceptboolean (default = True)

If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Notes

LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem,and consider maybe first DBSCAN to remove the outliers, or using leverage statistics to filter possible outliers.

Applications of LinearRegression

LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).

For additional docs, see scikitlearn’s OLS.

Examples

import numpy as np
import cudf

# Both import methods supported
from cuml import LinearRegression
from cuml.linear_model import LinearRegression

lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig")

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

reg = lr.fit(X,y)
print("Coefficients:")
print(reg.coef_)
print("intercept:")
print(reg.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = lr.predict(X_new)

print(preds)

Output:

Coefficients:

            0 1.0000001
            1 1.9999998

Intercept:
            3.0

Preds:

            0 15.999999
            1 14.999999
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y)

Fit the model with X and y.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predicts the y for X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Ridge Regression

class cuml.Ridge

Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.

cuML’s Ridge expects a cuDF DataFrame, and provides 3 algorithms SVD, Eig and CD to fit a linear model. SVD is more stable, but Eig (default) is much more faster. CD uses Coordinate Descent and can be faster if the data is large.

Parameters
alphafloat or double

Regularization strength - must be a positive float. Larger values specify stronger regularization. Array input will be supported later.

solver‘eig’ or ‘svd’ or ‘cd’ (default = ‘eig’)

Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.

fit_interceptboolean (default = True)

If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.

normalizeboolean (default = False)

If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

Notes

Ridge provides L2 regularization. This means that the coefficients can shrink to become very very small, but not zero. This can cause issues of interpretabiliy on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.

Applications of Ridge

Ridge Regression is used in the same way as LinearRegression, but is used more frequently as it does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.

For additional docs, see scikitlearn’s Ridge.

Examples

import numpy as np
import cudf

# Both import methods supported
from cuml import Ridge
from cuml.linear_model import Ridge

alpha = np.array([1.0])
ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False, solver = "eig")

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)

y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) )

result_ridge = ridge.fit(X_cudf, y_cudf)
print("Coefficients:")
print(result_ridge.coef_)
print("intercept:")
print(result_ridge.intercept_)

X_new = cudf.DataFrame()
X_new['col1'] = np.array([3,2], dtype = np.float32)
X_new['col2'] = np.array([5,5], dtype = np.float32)
preds = result_ridge.predict(X_new)

print(preds)

Output:

Coefficients:

            0 1.0000001
            1 1.9999998

Intercept:
            3.0

Preds:

            0 15.999999
            1 14.999999
Attributes
coef_array, shape (n_features)

The estimated coefficients for the linear regression model.

intercept_array

The independent term. If fit_intercept_ is False, will be 0.

Methods

fit(self, X, y)

Fit the model with X and y.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predicts the y for X.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

fit(self, X, y)

Fit the model with X and y.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predicts the y for X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params

Stochastic Gradient Descent

class cuml.SGD

Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.

cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.

Parameters
loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)

‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression

penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)

‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms

alpha: float (default = 0.0001)

The constant value which decides the degree of regularization

fit_interceptboolean (default = True)

If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.

epochsint (default = 1000)

The number of times the model should iterate through the entire dataset during training (default = 1000)

tolfloat (default = 1e-3)

The training process will stop if current_loss > previous_loss - tol

shuffleboolean (default = True)

True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch

eta0float (default = 0.0)

Initial learning rate

power_tfloat (default = 0.5)

The exponent used for calculating the invscaling learning rate

learning_rate‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’ (default = ‘constant’)

optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5

n_iter_no_changeint (default = 5)

the number of epochs to train without any imporvement in the model

Notes

For additional docs, see `scikitlearn’s OLS <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>

Examples

import numpy as np
import cudf
from cuml.solvers import SGD as cumlSGD

X = cudf.DataFrame()
X['col1'] = np.array([1,1,2,2], dtype = np.float32)
X['col2'] = np.array([1,2,2,3], dtype = np.float32)
y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32))
pred_data = cudf.DataFrame()
pred_data['col1'] = np.asarray([3, 2], dtype=datatype)
pred_data['col2'] = np.asarray([5, 5], dtype=datatype)

cu_sgd = cumlSGD(learning_rate=lrate, eta0=0.005, epochs=2000,
                fit_intercept=True, batch_size=2,
                tol=0.0, penalty=penalty, loss=loss)

cu_sgd.fit(X, y)
cu_pred = cu_sgd.predict(pred_data).to_array()
print(" cuML intercept : ", cu_sgd.intercept_)
print(" cuML coef : ", cu_sgd.coef_)
print("cuML predictions : ", cu_pred)

Output:

cuML intercept :  0.004561662673950195
cuML coef :  0      0.9834546
            1    0.010128272
           dtype: float32
cuML predictions :  [3.0055666 2.0221121]

Methods

fit(self, X, y)

Fit the model with X and y.

predict(self, X)

Predicts the y for X.

predictClass(self, X)

Predicts the y for X.

fit(self, X, y)

Fit the model with X and y.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

predict(self, X)

Predicts the y for X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

predictClass(self, X)

Predicts the y for X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

Returns
y: cuDF DataFrame

Dense vector (floats or doubles) of shape (n_samples, 1)

Nearest Neighbors

class cuml.NearestNeighbors

NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.

cuML’s KNN expects a cuDF DataFrame or a Numpy Array (where automatic chunking will be done in to a Numpy Array in a future release), and fits a special data structure first to approximate the distance calculations, allowing our querying times to be O(plogn) and not the brute force O(np) [where p = no(features)]:

Parameters
n_neighbors: int (default = 5)

The top K closest datapoints you want the algorithm to return. If this number is large, then expect the algorithm to run slower.

should_downcastbool (default = False)

Currently only single precision is supported in the underlying undex. Setting this to true will allow single-precision input arrays to be automatically downcasted to single precision. Default = False.

Notes

NearestNeighbors is a generative model. This means the data X has to be stored in order for inference to occur.

Applications of NearestNeighbors

Applications of NearestNeighbors include recommendation systems where content or colloborative filtering is used. Since NearestNeighbors is a relatively simple generative model, it is also used in data visualization and regression / classification tasks.

For an additional example see the NearestNeighbors notebook.

For additional docs, see scikitlearn’s NearestNeighbors.

Examples

import cudf
from cuml.neighbors import NearestNeighbors
import numpy as np

np_float = np.array([
  [1,2,3], # Point 1
  [1,2,4], # Point 2
  [2,2,4]  # Point 3
]).astype('float32')

gdf_float = cudf.DataFrame()
gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0])
gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1])
gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2])

print('n_samples = 3, n_dims = 3')
print(gdf_float)

nn_float = NearestNeighbors()
nn_float.fit(gdf_float)
distances,indices = nn_float.kneighbors(gdf_float,k=3) #get 3 nearest neighbors

print(indices)
print(distances)

Output:

import cudf

# Both import methods supported
# from cuml.neighbors import NearestNeighbors
from cuml import NearestNeighbors

n_samples = 3, n_dims = 3

dim_0 dim_1 dim_2

0   1.0   2.0   3.0
1   1.0   2.0   4.0
2   2.0   2.0   4.0

# indices:

         index_neighbor_0 index_neighbor_1 index_neighbor_2
0                0                1                2
1                1                0                2
2                2                1                0
# distances:

         distance_neighbor_0 distance_neighbor_1 distance_neighbor_2
0                 0.0                 1.0                 2.0
1                 0.0                 1.0                 1.0
2                 0.0                 1.0                 2.0

Methods

fit(self, X)

Fit GPU index for performing nearest neighbor queries.

kneighbors(self, X[, k])

Query the GPU index for the k nearest neighbors of row vectors in X.

fit(self, X)

Fit GPU index for performing nearest neighbor queries.

Parameters
XcuDF DataFrame or numpy ndarray

Dense matrix (floats or doubles) of shape (n_samples, n_features)

kneighbors(self, X, k=None)

Query the GPU index for the k nearest neighbors of row vectors in X.

Parameters
XcuDF DataFrame or numpy ndarray

Dense matrix (floats or doubles) of shape (n_samples, n_features)

k: Integer

The number of neighbors

Returns
distances: cuDF DataFrame or numpy ndarray

The distances of the k-nearest neighbors for each column vector in X

indices: cuDF DataFrame of numpy ndarray

The indices of the k-nearest neighbors for each column vector in X

K-Means Clustering

class cuml.KMeans

KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomnly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.

cuML’s KMeans expects a cuDF DataFrame, and supports the fast KMeans++ intialization method. This method is more stable than randomnly selecting K points.

Parameters
handlecuml.Handle

If it is None, a new one is created just for this class.

n_clustersint (default = 8)

The number of centroids or clusters you want.

max_iterint (default = 300)

The more iterations of EM, the more accurate, but slower.

tolfloat (default = 1e-4)

Stopping criterion when centroid means do not change much.

verboseboolean (default = 0)

If True, prints diagnositc information.

random_stateint (default = 1)

If you want results to be the same when you restart Python, select a state.

precompute_distancesboolean (default = ‘auto’)

Not supported yet.

init{‘scalable-kmeans++’, ‘k-means||’ , ‘random’ or an ndarray} (default = ‘scalable-k-means++’)

‘scalable-k-means++’ or ‘k-means||’: Uses fast and stable scalable kmeans++ intialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

n_initint (default = 1)

Number of times intialization is run. More is slower, but can be better.

algorithm“auto”

Currently uses full EM, but will support others later.

n_gpuint (default = 1)

Number of GPUs to use. Currently uses single GPU, but will support multiple GPUs later.

Notes

KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or T-SNE, and verify that they look appropriate.

Applications of KMeans

The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.

For additional docs, see scikitlearn’s Kmeans.

Examples

# Both import methods supported
from cuml import KMeans
from cuml.cluster import KMeans

import cudf
import numpy as np
import pandas as pd

def np2cudf(df):
    # convert numpy array to cuDF dataframe
    df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
    pdf = cudf.DataFrame()
    for c,column in enumerate(df):
      pdf[str(c)] = df[column]
    return pdf

a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)

print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)

print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Output:

input:

     0    1
 0  1.0  1.0
 1  1.0  2.0
 2  3.0  2.0
 3  4.0  3.0

Calling fit

labels:

   0    0
   1    0
   2    1
   3    1

cluster_centers:

   0    1
0  1.0  1.5
1  3.5  2.5
Attributes
cluster_centers_array

The coordinates of the final clusters. This represents of “mean” of each data cluster.

labels_array

Which cluster each datapoint belongs to.

Methods

fit(self, X)

Compute k-means clustering with X.

fit_predict(self, X)

Compute cluster centers and predict cluster index for each sample.

fit_transform(self, input_gdf)

Compute clustering and transform input_gdf to cluster-distance space.

get_params(self[, deep])

Sklearn style return parameter state

predict(self, X)

Predict the closest cluster each sample in X belongs to.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

transform(self, X)

Transform X to a cluster-distance space.

fit(self, X)

Compute k-means clustering with X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

fit_predict(self, X)

Compute cluster centers and predict cluster index for each sample.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

fit_transform(self, input_gdf)

Compute clustering and transform input_gdf to cluster-distance space.

Parameters
input_gdfcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
predict(self, X)

Predict the closest cluster each sample in X belongs to.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params
transform(self, X)

Transform X to a cluster-distance space.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

DBSCAN

class cuml.DBSCAN

DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.

cuML’s DBSCAN expects a cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.

Parameters
epsfloat (default = 0.5)

The maximum distance between 2 points such they reside in the same neighborhood.

handlecuml.Handle

If it is None, a new one is created just for this class

min_samplesint (default = 5)

The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).

verbosebool

Whether to print debug spews

max_bytes_per_batch(optional) int64

Calculate batch size using no more than this number of bytes for the pairwise distance computation. This enables the trade-off between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.

Notes

DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.

Applications of DBSCAN

DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find non-linearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisons in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.

For an additional example, see the DBSCAN notebook. For additional docs, see scikitlearn’s DBSCAN.

Examples

# Both import methods supported
from cuml import DBSCAN
from cuml.cluster import DBSCAN

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

dbscan_float = DBSCAN(eps = 1.0, min_samples = 1)
dbscan_float.fit(gdf_float)
print(dbscan_float.labels_)

Output:

0    0
1    1
2    2
Attributes
labels_array

Which cluster each datapoint belongs to. Noisy samples are labeled as -1.

Methods

fit(self, X)

Perform DBSCAN clustering from features.

fit_predict(self, X)

Performs clustering on input_gdf and returns cluster labels.

get_param_names(self)

fit(self, X)

Perform DBSCAN clustering from features.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

fit_predict(self, X)

Performs clustering on input_gdf and returns cluster labels.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features),

Returns
ycuDF Series, shape (n_samples)

cluster labels

get_param_names(self)

Kalman Filter

class cuml.KalmanFilter

Implements a Kalman filter. You are responsible for setting the various state variables to reasonable values; defaults will not give you a functional filter. After construction the filter will have default matrices created for you, but you must specify the values for each.

Parameters
dim_xint

Number of state variables for the Kalman filter. This is used to set the default size of P, Q, and u

dim_zint

Number of of measurement inputs.

Examples

from cuml import KalmanFilter
f = KalmanFilter(dim_x=2, dim_z=1)
f.x = np.array([[2.],    # position
                [0.]])   # velocity
f.F = np.array([[1.,1.], [0.,1.]])
f.H = np.array([[1.,0.]])
f.P = np.array([[1000., 0.], [   0., 1000.] ])
f.R = 5

Now just perform the standard predict/update loop:

while some_condition_is_true:
    z = numba.cuda.to_device(np.array([i])
    f.predict()
    f.update(z)
Attributes
xnumba device array, numpy array or cuDF series (dim_x, 1),

Current state estimate. Any call to update() or predict() updates this variable.

Pnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)

Current state covariance matrix. Any call to update() or predict() updates this variable.

x_priornumba device array, numpy array or cuDF series(dim_x, 1)

Prior (predicted) state estimate. The *_prior and *_post attributes are for convienence; they store the prior and posterior of the current epoch. Read Only.

P_priornumba device array, numpy array or cuDF dataframe(dim_x, dim_x)

Prior (predicted) state covariance matrix. Read Only.

x_postnumba device array, numpy array or cuDF series(dim_x, 1)

Posterior (updated) state estimate. Read Only.

P_postnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)

Posterior (updated) state covariance matrix. Read Only.

znumba device array or cuDF series (dim_x, 1)

Last measurement used in update(). Read only.

Rnumba device array(dim_z, dim_z)

Measurement noise matrix

Qnumba device array(dim_x, dim_x)

Process noise matrix

Fnumba device array()

State Transition matrix

Hnumba device array(dim_z, dim_x)

Measurement function

ynumba device array

Residual of the update step. Read only.

Knumba device array(dim_x, dim_z)

Kalman gain of the update step. Read only.

precision: ‘single’ or ‘double’

Whether the Kalman Filter uses single or double precision

Methods

predict(self[, B, F, Q])

Predict next state (prior) using the Kalman filter state propagation equations.

update(self, z[, R, H])

Add a new measurement (z) to the Kalman filter.

predict(self, B=None, F=None, Q=None)

Predict next state (prior) using the Kalman filter state propagation equations. Parameters ———- u : np.array

Optional control vector. If not None, it is multiplied by B to create the control input into the system.

Bnp.array(dim_x, dim_z), or None

Optional control transition matrix; a value of None will cause the filter to use self.B.

Fnp.array(dim_x, dim_x), or None

Optional state transition matrix; a value of None will cause the filter to use self.F.

Qnp.array(dim_x, dim_x), scalar, or None

Optional process noise matrix; a value of None will cause the filter to use self.Q.

update(self, z, R=None, H=None)

Add a new measurement (z) to the Kalman filter. If z is None, nothing is computed. However, x_post and P_post are updated with the prior (x_prior, P_prior), and self.z is set to None. Parameters ———- z : (dim_z, 1): array_like

measurement for this update. z can be a scalar if dim_z is 1, otherwise it must be convertible to a column vector.

Rnp.array, scalar, or None

Optionally provide R to override the measurement noise for this one call, otherwise self.R will be used.

Hnp.array, or None

Optionally provide H to override the measurement function for this one call, otherwise self.H will be used.

Principal Component Analysis

class cuml.PCA

PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.

cuML’s PCA expects a cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.

Parameters
n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)

Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.

iterated_powerint (default = 15)

Used in Jacobi solver. The more iterations, the more accurate, but the slower.

tolfloat (default = 1e-7)

Used if algorithm = “jacobi”. The smaller the tolerance, the more accurate, but the more slower the algorithm will get to converge.

random_stateint / None (default = None)

If you want results to be the same when you restart Python, select a state.

copyboolean (default = True)

If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.

whitenboolean (default = False)

If True, de-correlates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multi-collinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.

Notes

PCA considers linear combinations of features, specifically those that maximise global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or T-SNE for a locally important embedding.

Applications of PCA

PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.

For an additional example see the PCA notebook. For additional docs, see scikitlearn’s PCA.

Examples

# Both import methods supported
from cuml import PCA
from cuml.decomposition import PCA

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

pca_float = PCA(n_components = 2)
pca_float.fit(gdf_float)

print(f'components: {pca_float.components_}')
print(f'explained variance: {pca_float.explained_variance_}')
print(f'explained variance ratio: {pca_float.explained_variance_ratio_}')

print(f'singular values: {pca_float.singular_values_}')
print(f'mean: {pca_float.mean_}')
print(f'noise variance: {pca_float.noise_variance_}')

trans_gdf_float = pca_float.transform(gdf_float)
print(f'Inverse: {trans_gdf_float}')

input_gdf_float = pca_float.inverse_transform(trans_gdf_float)
print(f'Input: {input_gdf_float}')

Output:

components:
            0           1           2
            0  0.69225764  -0.5102837 -0.51028395
            1 -0.72165036 -0.48949987  -0.4895003

explained variance:

            0   8.510402
            1 0.48959687

explained variance ratio:

             0   0.9456003
             1 0.054399658

singular values:

           0 4.1256275
           1 0.9895422

mean:

          0 2.6666667
          1 2.3333333
          2 2.3333333

noise variance:

      0  0.0

transformed matrix:
             0           1
             0   -2.8547091 -0.42891636
             1 -0.121316016  0.80743366
             2    2.9760244 -0.37851727

Input Matrix:
          0         1         2
          0 1.0000001 3.9999993       4.0
          1       2.0 2.0000002 1.9999999
          2 4.9999995 1.0000006       1.0
Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

mean_array

The column wise mean of X. Used to mean - center the data first.

noise_variance_float

From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.

Methods

fit(self, X[, _transform])

Fit the model with X.

fit_transform(self, X[, y])

Fit the model with X and apply the dimensionality reduction on X.

get_params(self[, deep])

Sklearn style return parameter state

inverse_transform(self, X)

Transform data back to its original space.

set_params(self, **parameter)

transform(self, X)

Apply dimensionality reduction to X.

fit(self, X, _transform=False)

Fit the model with X.

Parameters
XcuDF DataFrame

Dense matrix (floats or doubles) of shape (n_samples, n_features)

Returns
cluster labels
fit_transform(self, X, y=None)

Fit the model with X and apply the dimensionality reduction on X.

Parameters
XcuDF DataFrame, shape (n_samples, n_features)

training data (floats or doubles), where n_samples is the number of samples, and n_features is the number of features.

yignored
Returns
X_newcuDF DataFrame, shape (n_samples, n_components)
get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
inverse_transform(self, X)

Transform data back to its original space.

In other words, return an input X_original whose transform would be X.

Parameters
XcuDF DataFrame, shape (n_samples, n_components)

New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components.

Returns
X_originalcuDF DataFrame, shape (n_samples, n_features)
set_params(self, **parameter)
transform(self, X)

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Parameters
XcuDF DataFrame, shape (n_samples, n_features)

New data (floats or doubles), where n_samples is the number of samples and n_features is the number of features.

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Truncated SVD

class cuml.TruncatedSVD

TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.

cuML’s TruncatedSVD expects a cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.

Parameters
n_componentsint (default = 1)

The number of top K singular vectors / values you want. Must be <= number(columns).

algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)

Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.

n_iterint (default = 15)

Used in Jacobi solver. The more iterations, the more accurate, but the slower.

tolfloat (default = 1e-7)

Used if algorithm = “jacobi”. The smaller the tolerance, the more accurate, but the more slower the algorithm will get to converge.

random_stateint / None (default = None)

If you want results to be the same when you restart Python, select a state.

Notes

TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many many components.

Applications of TruncatedSVD

TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.

For additional examples, see the Truncated SVD notebook. For additional documentation, see scikitlearn’s TruncatedSVD docs.

Examples

# Both import methods supported
from cuml import TruncatedSVD
from cuml.decomposition import TruncatedSVD

import cudf
import numpy as np

gdf_float = cudf.DataFrame()
gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32)
gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)
gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32)

tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi", n_iter = 20, tol = 1e-9)
tsvd_float.fit(gdf_float)

print(f'components: {tsvd_float.components_}')
print(f'explained variance: {tsvd_float.explained_variance_}')
print(f'explained variance ratio: {tsvd_float.explained_variance_ratio_}')
print(f'singular values: {tsvd_float.singular_values_}')

trans_gdf_float = tsvd_float.transform(gdf_float)
print(f'Transformed matrix: {trans_gdf_float}')

input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float)
print(f'Input matrix: {input_gdf_float}')

Output:

components:            0           1          2
0 0.58725953  0.57233137  0.5723314
1 0.80939883 -0.41525528 -0.4152552
explained variance:
0  55.33908
1 16.660923

explained variance ratio:
0  0.7685983
1 0.23140171

singular values:
0  7.439024
1 4.0817795

Transformed matrix:           0            1
0 5.1659107    -2.512643
1 3.4638448 -0.042223275                                                                                                                     2 4.0809603    3.2164836

Input matrix:           0         1         2
0       1.0  4.000001  4.000001
1 2.0000005 2.0000005 2.0000007
2  5.000001 0.9999999 1.0000004
Attributes
components_array

The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)

explained_variance_array

How much each component explains the variance in the data given by S**2

explained_variance_ratio_array

How much in % the variance is explained given by S**2/sum(S**2)

singular_values_array

The top K singular values. Remember all singular values >= 0

Methods

fit(self, X[, _transform])

Fit LSI model on training cudf DataFrame X.

fit_transform(self, X)

Fit LSI model to X and perform dimensionality reduction on X.

get_params(self[, deep])

Sklearn style return parameter state

inverse_transform(self, X)

Transform X back to its original space.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

transform(self, X)

Perform dimensionality reduction on X.

fit(self, X, _transform=True)

Fit LSI model on training cudf DataFrame X.

Parameters
XcuDF DataFrame, dense matrix, shape (n_samples, n_features)

Training data (floats or doubles)

fit_transform(self, X)

Fit LSI model to X and perform dimensionality reduction on X.

Parameters
X GDFcuDF DataFrame, dense matrix, shape (n_samples, n_features)

Training data (floats or doubles)

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Reduced version of X. This will always be a dense cuDF DataFrame

get_params(self, deep=True)

Sklearn style return parameter state

Parameters
deepboolean (default = True)
inverse_transform(self, X)

Transform X back to its original space.

Returns a cuDF DataFrame X_original whose transform would be X.

Parameters
XcuDF DataFrame, shape (n_samples, n_components)

New data.

Returns
X_originalcuDF DataFrame, shape (n_samples, n_features)

Note that this is always a dense cuDF DataFrame.

set_params(self, **params)

Sklearn style set parameter state to dictionary of params.

Parameters
paramsdict of new params
transform(self, X)

Perform dimensionality reduction on X.

Parameters
XcuDF DataFrame, dense matrix, shape (n_samples, n_features)

New data.

Returns
X_newcuDF DataFrame, shape (n_samples, n_components)

Reduced version of X. This will always be a dense DataFrame.

UMAP

class cuml.UMAP

Uniform Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates an underlying manifold.

Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py

Parameters
n_neighbors: float (optional, default 15)

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_components: int (optional, default 2)

The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any

n_epochs: int (optional, default None)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

learning_rate: float (optional, default 1.0)

The initial learning rate for the embedding optimization.

init: string (optional, default ‘spectral’)
How to initialize the low dimensional embedding. Options are:
  • ‘spectral’: use a spectral embedding of the fuzzy 1-skeleton

  • ‘random’: assign initial embedding positions at random.

min_dist: float (optional, default 0.1)

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spread: float (optional, default 1.0)

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

set_op_mix_ratio: float (optional, default 1.0)

Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity: int (optional, default 1)

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strength: float (optional, default 1.0)

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rate: int (optional, default 5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size: float (optional, default 4.0)

For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

b: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

verbose: bool (optional, default False)

Controls verbosity of logging.

Notes

This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap:

  • Specifying the random seed

  • Using a non-euclidean distance metric (support for a fixed set of non-euclidean metrics is planned for an upcoming release).

  • Using a pre-computed pairwise distance matrix (under consideration for future releases)

  • Manual initialization of initial embedding positions

In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.

References

Methods

fit()

Fit X into an embedded space.

fit_transform()

Fit X into an embedded space and return that transformed output.

transform()

Transform X into the existing embedded space and return that transformed output.

fit()

Fit X into an embedded space. Parameters ———- X : array, shape (n_samples, n_features)

X contains a sample per row.

yarray, shape (n_samples)

y contains a label per row.

fit_transform()

Fit X into an embedded space and return that transformed output. Parameters ———- X : array, shape (n_samples, n_features) or (n_samples, n_samples)

X contains a sample per row.

X_newarray, shape (n_samples, n_components)

Embedding of the training data in low-dimensional space.

transform()

Transform X into the existing embedded space and return that transformed output.

Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().

Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158

Parameters
Xarray, shape (n_samples, n_features)

New data to be transformed.

Returns
——-
X_newarray, shape (n_samples, n_components)

Embedding of the new data in low-dimensional space.