cuML API Reference¶
Linear Regression¶

class
cuml.
LinearRegression
¶ LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.
cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms SVD and Eig to fit a linear model. SVD is more stable, but Eig (default) is much more faster.
 Parameters
 algorithm‘eig’ or ‘svd’ (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable.
 fit_interceptboolean (default = True)
If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
Notes
LinearRegression suffers from multicollinearity (when columns are correlated with each other), and variance explosions from outliers. Consider using Ridge Regression to fix the multicollinearity problem,and consider maybe first DBSCAN to remove the outliers, or using leverage statistics to filter possible outliers.
Applications of LinearRegression
LinearRegression is used in regression tasks where one wants to predict say sales or house prices. It is also used in extrapolation or time series tasks, dynamic systems modelling and many other machine learning tasks. This model should be first tried if the machine learning problem is a regression task (predicting a continuous variable).
For additional docs, see scikitlearn’s OLS.
Examples
import numpy as np import cudf # Both import methods supported from cuml import LinearRegression from cuml.linear_model import LinearRegression lr = LinearRegression(fit_intercept = True, normalize = False, algorithm = "eig") X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) reg = lr.fit(X,y) print("Coefficients:") print(reg.coef_) print("intercept:") print(reg.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = lr.predict(X_new) print(preds)
Output:
Coefficients: 0 1.0000001 1 1.9999998 Intercept: 3.0 Preds: 0 15.999999 1 14.999999
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If fit_intercept_ is False, will be 0.
Methods
fit
(self, X, y)Fit the model with X and y.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predicts the y for X.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params
Ridge Regression¶

class
cuml.
Ridge
¶ Ridge extends LinearRegression by providing L2 regularization on the coefficients when predicting response y with a linear combination of the predictors in X. It can reduce the variance of the predictors, and improves the conditioning of the problem.
cuML’s Ridge expects a cuDF DataFrame, and provides 3 algorithms SVD, Eig and CD to fit a linear model. SVD is more stable, but Eig (default) is much more faster. CD uses Coordinate Descent and can be faster if the data is large.
 Parameters
 alphafloat or double
Regularization strength  must be a positive float. Larger values specify stronger regularization. Array input will be supported later.
 solver‘eig’ or ‘svd’ or ‘cd’ (default = ‘eig’)
Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable. CD or Coordinate Descent is very fast and is suitable for large problems.
 fit_interceptboolean (default = True)
If True, Ridge tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 normalizeboolean (default = False)
If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.
Notes
Ridge provides L2 regularization. This means that the coefficients can shrink to become very very small, but not zero. This can cause issues of interpretabiliy on the coefficients. Consider using Lasso, or thresholding small coefficients to zero.
Applications of Ridge
Ridge Regression is used in the same way as LinearRegression, but is used more frequently as it does not suffer from multicollinearity issues. Ridge is used in insurance premium prediction, stock market analysis and much more.
For additional docs, see scikitlearn’s Ridge.
Examples
import numpy as np import cudf # Both import methods supported from cuml import Ridge from cuml.linear_model import Ridge alpha = np.array([1.0]) ridge = Ridge(alpha = alpha, fit_intercept = True, normalize = False, solver = "eig") X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series( np.array([6.0, 8.0, 9.0, 11.0], dtype = np.float32) ) result_ridge = ridge.fit(X_cudf, y_cudf) print("Coefficients:") print(result_ridge.coef_) print("intercept:") print(result_ridge.intercept_) X_new = cudf.DataFrame() X_new['col1'] = np.array([3,2], dtype = np.float32) X_new['col2'] = np.array([5,5], dtype = np.float32) preds = result_ridge.predict(X_new) print(preds)
Output:
Coefficients: 0 1.0000001 1 1.9999998 Intercept: 3.0 Preds: 0 15.999999 1 14.999999
 Attributes
 coef_array, shape (n_features)
The estimated coefficients for the linear regression model.
 intercept_array
The independent term. If fit_intercept_ is False, will be 0.
Methods
fit
(self, X, y)Fit the model with X and y.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predicts the y for X.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params
Stochastic Gradient Descent¶

class
cuml.
SGD
¶ Stochastic Gradient Descent is a very common machine learning algorithm where one optimizes some cost function via gradient steps. This makes SGD very attractive for large problems when the exact solution is hard or even impossible to find.
cuML’s SGD algorithm accepts a numpy matrix or a cuDF DataFrame as the input dataset. The SGD algorithm currently works with linear regression, ridge regression and SVM models.
 Parameters
 loss‘hinge’, ‘log’, ‘squared_loss’ (default = ‘squared_loss’)
‘hinge’ uses linear SVM ‘log’ uses logistic regression ‘squared_loss’ uses linear regression
 penalty: ‘none’, ‘l1’, ‘l2’, ‘elasticnet’ (default = ‘none’)
‘none’ does not perform any regularization ‘l1’ performs L1 norm (Lasso) which minimizes the sum of the abs value of coefficients ‘l2’ performs L2 norm (Ridge) which minimizes the sum of the square of the coefficients ‘elasticnet’ performs Elastic Net regularization which is a weighted average of L1 and L2 norms
 alpha: float (default = 0.0001)
The constant value which decides the degree of regularization
 fit_interceptboolean (default = True)
If True, the model tries to correct for the global mean of y. If False, the model expects that you have centered the data.
 epochsint (default = 1000)
The number of times the model should iterate through the entire dataset during training (default = 1000)
 tolfloat (default = 1e3)
The training process will stop if current_loss > previous_loss  tol
 shuffleboolean (default = True)
True, shuffles the training data after each epoch False, does not shuffle the training data after each epoch
 eta0float (default = 0.0)
Initial learning rate
 power_tfloat (default = 0.5)
The exponent used for calculating the invscaling learning rate
 learning_rate‘optimal’, ‘constant’, ‘invscaling’, ‘adaptive’ (default = ‘constant’)
optimal option supported in the next version constant keeps the learning rate constant adaptive changes the learning rate if the training loss or the validation accuracy does not improve for n_iter_no_change epochs. The old learning rate is generally divide by 5
 n_iter_no_changeint (default = 5)
the number of epochs to train without any imporvement in the model
Notes
For additional docs, see `scikitlearn’s OLS <https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html>
Examples
import numpy as np import cudf from cuml.solvers import SGD as cumlSGD X = cudf.DataFrame() X['col1'] = np.array([1,1,2,2], dtype = np.float32) X['col2'] = np.array([1,2,2,3], dtype = np.float32) y = cudf.Series(np.array([1, 1, 2, 2], dtype=np.float32)) pred_data = cudf.DataFrame() pred_data['col1'] = np.asarray([3, 2], dtype=datatype) pred_data['col2'] = np.asarray([5, 5], dtype=datatype) cu_sgd = cumlSGD(learning_rate=lrate, eta0=0.005, epochs=2000, fit_intercept=True, batch_size=2, tol=0.0, penalty=penalty, loss=loss) cu_sgd.fit(X, y) cu_pred = cu_sgd.predict(pred_data).to_array() print(" cuML intercept : ", cu_sgd.intercept_) print(" cuML coef : ", cu_sgd.coef_) print("cuML predictions : ", cu_pred)
Output:
cuML intercept : 0.004561662673950195 cuML coef : 0 0.9834546 1 0.010128272 dtype: float32 cuML predictions : [3.0055666 2.0221121]
Methods
fit
(self, X, y)Fit the model with X and y.
predict
(self, X)Predicts the y for X.
predictClass
(self, X)Predicts the y for X.

fit
(self, X, y)¶ Fit the model with X and y.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

predict
(self, X)¶ Predicts the y for X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)

predictClass
(self, X)¶ Predicts the y for X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 Returns
 y: cuDF DataFrame
Dense vector (floats or doubles) of shape (n_samples, 1)
Nearest Neighbors¶

class
cuml.
NearestNeighbors
¶ NearestNeighbors is a unsupervised algorithm where if one wants to find the “closest” datapoint(s) to new unseen data, one can calculate a suitable “distance” between each and every point, and return the top K datapoints which have the smallest distance to it.
cuML’s KNN expects a cuDF DataFrame or a Numpy Array (where automatic chunking will be done in to a Numpy Array in a future release), and fits a special data structure first to approximate the distance calculations, allowing our querying times to be O(plogn) and not the brute force O(np) [where p = no(features)]:
 Parameters
 n_neighbors: int (default = 5)
The top K closest datapoints you want the algorithm to return. If this number is large, then expect the algorithm to run slower.
 should_downcastbool (default = False)
Currently only single precision is supported in the underlying undex. Setting this to true will allow singleprecision input arrays to be automatically downcasted to single precision. Default = False.
Notes
NearestNeighbors is a generative model. This means the data X has to be stored in order for inference to occur.
Applications of NearestNeighbors
Applications of NearestNeighbors include recommendation systems where content or colloborative filtering is used. Since NearestNeighbors is a relatively simple generative model, it is also used in data visualization and regression / classification tasks.
For an additional example see the NearestNeighbors notebook.
For additional docs, see scikitlearn’s NearestNeighbors.
Examples
import cudf from cuml.neighbors import NearestNeighbors import numpy as np np_float = np.array([ [1,2,3], # Point 1 [1,2,4], # Point 2 [2,2,4] # Point 3 ]).astype('float32') gdf_float = cudf.DataFrame() gdf_float['dim_0'] = np.ascontiguousarray(np_float[:,0]) gdf_float['dim_1'] = np.ascontiguousarray(np_float[:,1]) gdf_float['dim_2'] = np.ascontiguousarray(np_float[:,2]) print('n_samples = 3, n_dims = 3') print(gdf_float) nn_float = NearestNeighbors() nn_float.fit(gdf_float) distances,indices = nn_float.kneighbors(gdf_float,k=3) #get 3 nearest neighbors print(indices) print(distances)
Output:
import cudf # Both import methods supported # from cuml.neighbors import NearestNeighbors from cuml import NearestNeighbors n_samples = 3, n_dims = 3 dim_0 dim_1 dim_2 0 1.0 2.0 3.0 1 1.0 2.0 4.0 2 2.0 2.0 4.0 # indices: index_neighbor_0 index_neighbor_1 index_neighbor_2 0 0 1 2 1 1 0 2 2 2 1 0 # distances: distance_neighbor_0 distance_neighbor_1 distance_neighbor_2 0 0.0 1.0 2.0 1 0.0 1.0 1.0 2 0.0 1.0 2.0
Methods
fit
(self, X)Fit GPU index for performing nearest neighbor queries.
kneighbors
(self, X[, k])Query the GPU index for the k nearest neighbors of row vectors in X.

fit
(self, X)¶ Fit GPU index for performing nearest neighbor queries.
 Parameters
 XcuDF DataFrame or numpy ndarray
Dense matrix (floats or doubles) of shape (n_samples, n_features)

kneighbors
(self, X, k=None)¶ Query the GPU index for the k nearest neighbors of row vectors in X.
 Parameters
 XcuDF DataFrame or numpy ndarray
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 k: Integer
The number of neighbors
 Returns
 distances: cuDF DataFrame or numpy ndarray
The distances of the knearest neighbors for each column vector in X
 indices: cuDF DataFrame of numpy ndarray
The indices of the knearest neighbors for each column vector in X
KMeans Clustering¶

class
cuml.
KMeans
¶ KMeans is a basic but powerful clustering method which is optimized via Expectation Maximization. It randomnly selects K data points in X, and computes which samples are close to these points. For every cluster of points, a mean is computed (hence the name), and this becomes the new centroid.
cuML’s KMeans expects a cuDF DataFrame, and supports the fast KMeans++ intialization method. This method is more stable than randomnly selecting K points.
 Parameters
 handlecuml.Handle
If it is None, a new one is created just for this class.
 n_clustersint (default = 8)
The number of centroids or clusters you want.
 max_iterint (default = 300)
The more iterations of EM, the more accurate, but slower.
 tolfloat (default = 1e4)
Stopping criterion when centroid means do not change much.
 verboseboolean (default = 0)
If True, prints diagnositc information.
 random_stateint (default = 1)
If you want results to be the same when you restart Python, select a state.
 precompute_distancesboolean (default = ‘auto’)
Not supported yet.
 init{‘scalablekmeans++’, ‘kmeans’ , ‘random’ or an ndarray} (default = ‘scalablekmeans++’)
‘scalablekmeans++’ or ‘kmeans’: Uses fast and stable scalable kmeans++ intialization. ‘random’: Choose ‘n_cluster’ observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
 n_initint (default = 1)
Number of times intialization is run. More is slower, but can be better.
 algorithm“auto”
Currently uses full EM, but will support others later.
 n_gpuint (default = 1)
Number of GPUs to use. Currently uses single GPU, but will support multiple GPUs later.
Notes
KMeans requires n_clusters to be specified. This means one needs to approximately guess or know how many clusters a dataset has. If one is not sure, one can start with a small number of clusters, and visualize the resulting clusters with PCA, UMAP or TSNE, and verify that they look appropriate.
Applications of KMeans
The biggest advantage of KMeans is its speed and simplicity. That is why KMeans is many practitioner’s first choice of a clustering algorithm. KMeans has been extensively used when the number of clusters is approximately known, such as in big data clustering tasks, image segmentation and medical clustering.
For additional docs, see scikitlearn’s Kmeans.
Examples
# Both import methods supported from cuml import KMeans from cuml.cluster import KMeans import cudf import numpy as np import pandas as pd def np2cudf(df): # convert numpy array to cuDF dataframe df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])}) pdf = cudf.DataFrame() for c,column in enumerate(df): pdf[str(c)] = df[column] return pdf a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],dtype=np.float32) b = np2cudf(a) print("input:") print(b) print("Calling fit") kmeans_float = KMeans(n_clusters=2, n_gpu=1) kmeans_float.fit(b) print("labels:") print(kmeans_float.labels_) print("cluster_centers:") print(kmeans_float.cluster_centers_)
Output:
input: 0 1 0 1.0 1.0 1 1.0 2.0 2 3.0 2.0 3 4.0 3.0 Calling fit labels: 0 0 1 0 2 1 3 1 cluster_centers: 0 1 0 1.0 1.5 1 3.5 2.5
 Attributes
 cluster_centers_array
The coordinates of the final clusters. This represents of “mean” of each data cluster.
 labels_array
Which cluster each datapoint belongs to.
Methods
fit
(self, X)Compute kmeans clustering with X.
fit_predict
(self, X)Compute cluster centers and predict cluster index for each sample.
fit_transform
(self, input_gdf)Compute clustering and transform input_gdf to clusterdistance space.
get_params
(self[, deep])Sklearn style return parameter state
predict
(self, X)Predict the closest cluster each sample in X belongs to.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.
transform
(self, X)Transform X to a clusterdistance space.

fit
(self, X)¶ Compute kmeans clustering with X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)

fit_predict
(self, X)¶ Compute cluster centers and predict cluster index for each sample.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)

fit_transform
(self, input_gdf)¶ Compute clustering and transform input_gdf to clusterdistance space.
 Parameters
 input_gdfcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

predict
(self, X)¶ Predict the closest cluster each sample in X belongs to.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params

transform
(self, X)¶ Transform X to a clusterdistance space.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
DBSCAN¶

class
cuml.
DBSCAN
¶ DBSCAN is a very powerful yet fast clustering technique that finds clusters where data is concentrated. This allows DBSCAN to generalize to many problems if the datapoints tend to congregate in larger groups.
cuML’s DBSCAN expects a cuDF DataFrame, and constructs an adjacency graph to compute the distances between close neighbours.
 Parameters
 epsfloat (default = 0.5)
The maximum distance between 2 points such they reside in the same neighborhood.
 handlecuml.Handle
If it is None, a new one is created just for this class
 min_samplesint (default = 5)
The number of samples in a neighborhood such that this group can be considered as an important core point (including the point itself).
 verbosebool
Whether to print debug spews
 max_bytes_per_batch(optional) int64
Calculate batch size using no more than this number of bytes for the pairwise distance computation. This enables the tradeoff between runtime and memory usage for making the N^2 pairwise distance computations more tractable for large numbers of samples. If you are experiencing out of memory errors when running DBSCAN, you can set this value based on the memory size of your device. Note: this option does not set the maximum total memory used in the DBSCAN computation and so this value will not be able to be set to the total memory available on the device.
Notes
DBSCAN is very sensitive to the distance metric it is used with, and a large assumption is that datapoints need to be concentrated in groups for clusters to be constructed.
Applications of DBSCAN
DBSCAN’s main benefit is that the number of clusters is not a hyperparameter, and that it can find nonlinearly shaped clusters. This also allows DBSCAN to be robust to noise. DBSCAN has been applied to analyzing particle collisons in the Large Hadron Collider, customer segmentation in marketing analyses, and much more.
For an additional example, see the DBSCAN notebook. For additional docs, see scikitlearn’s DBSCAN.
Examples
# Both import methods supported from cuml import DBSCAN from cuml.cluster import DBSCAN import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) dbscan_float = DBSCAN(eps = 1.0, min_samples = 1) dbscan_float.fit(gdf_float) print(dbscan_float.labels_)
Output:
0 0 1 1 2 2
 Attributes
 labels_array
Which cluster each datapoint belongs to. Noisy samples are labeled as 1.
Methods
fit
(self, X)Perform DBSCAN clustering from features.
fit_predict
(self, X)Performs clustering on input_gdf and returns cluster labels.
get_param_names
(self)
fit
(self, X)¶ Perform DBSCAN clustering from features.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)

fit_predict
(self, X)¶ Performs clustering on input_gdf and returns cluster labels.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features),
 Returns
 ycuDF Series, shape (n_samples)
cluster labels

get_param_names
(self)¶
Kalman Filter¶

class
cuml.
KalmanFilter
¶ Implements a Kalman filter. You are responsible for setting the various state variables to reasonable values; defaults will not give you a functional filter. After construction the filter will have default matrices created for you, but you must specify the values for each.
 Parameters
 dim_xint
Number of state variables for the Kalman filter. This is used to set the default size of P, Q, and u
 dim_zint
Number of of measurement inputs.
Examples
from cuml import KalmanFilter f = KalmanFilter(dim_x=2, dim_z=1) f.x = np.array([[2.], # position [0.]]) # velocity f.F = np.array([[1.,1.], [0.,1.]]) f.H = np.array([[1.,0.]]) f.P = np.array([[1000., 0.], [ 0., 1000.] ]) f.R = 5
Now just perform the standard predict/update loop:
while some_condition_is_true: z = numba.cuda.to_device(np.array([i]) f.predict() f.update(z)
 Attributes
 xnumba device array, numpy array or cuDF series (dim_x, 1),
Current state estimate. Any call to update() or predict() updates this variable.
 Pnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)
Current state covariance matrix. Any call to update() or predict() updates this variable.
 x_priornumba device array, numpy array or cuDF series(dim_x, 1)
Prior (predicted) state estimate. The *_prior and *_post attributes are for convienence; they store the prior and posterior of the current epoch. Read Only.
 P_priornumba device array, numpy array or cuDF dataframe(dim_x, dim_x)
Prior (predicted) state covariance matrix. Read Only.
 x_postnumba device array, numpy array or cuDF series(dim_x, 1)
Posterior (updated) state estimate. Read Only.
 P_postnumba device array, numpy array or cuDF dataframe(dim_x, dim_x)
Posterior (updated) state covariance matrix. Read Only.
 znumba device array or cuDF series (dim_x, 1)
Last measurement used in update(). Read only.
 Rnumba device array(dim_z, dim_z)
Measurement noise matrix
 Qnumba device array(dim_x, dim_x)
Process noise matrix
 Fnumba device array()
State Transition matrix
 Hnumba device array(dim_z, dim_x)
Measurement function
 ynumba device array
Residual of the update step. Read only.
 Knumba device array(dim_x, dim_z)
Kalman gain of the update step. Read only.
 precision: ‘single’ or ‘double’
Whether the Kalman Filter uses single or double precision
Methods
predict
(self[, B, F, Q])Predict next state (prior) using the Kalman filter state propagation equations.
update
(self, z[, R, H])Add a new measurement (z) to the Kalman filter.

predict
(self, B=None, F=None, Q=None)¶ Predict next state (prior) using the Kalman filter state propagation equations. Parameters ——— u : np.array
Optional control vector. If not None, it is multiplied by B to create the control input into the system.
 Bnp.array(dim_x, dim_z), or None
Optional control transition matrix; a value of None will cause the filter to use self.B.
 Fnp.array(dim_x, dim_x), or None
Optional state transition matrix; a value of None will cause the filter to use self.F.
 Qnp.array(dim_x, dim_x), scalar, or None
Optional process noise matrix; a value of None will cause the filter to use self.Q.

update
(self, z, R=None, H=None)¶ Add a new measurement (z) to the Kalman filter. If z is None, nothing is computed. However, x_post and P_post are updated with the prior (x_prior, P_prior), and self.z is set to None. Parameters ——— z : (dim_z, 1): array_like
measurement for this update. z can be a scalar if dim_z is 1, otherwise it must be convertible to a column vector.
 Rnp.array, scalar, or None
Optionally provide R to override the measurement noise for this one call, otherwise self.R will be used.
 Hnp.array, or None
Optionally provide H to override the measurement function for this one call, otherwise self.H will be used.
Principal Component Analysis¶

class
cuml.
PCA
¶ PCA (Principal Component Analysis) is a fundamental dimensionality reduction technique used to combine features in X in linear combinations such that each new component captures the most information or variance of the data. N_components is usually small, say at 3, where it can be used for data visualization, data compression and exploratory analysis.
cuML’s PCA expects a cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K eigenvectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K eigenvectors, but might be less accurate.
 Parameters
 n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
 svd_solver‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
 iterated_powerint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but the slower.
 tolfloat (default = 1e7)
Used if algorithm = “jacobi”. The smaller the tolerance, the more accurate, but the more slower the algorithm will get to converge.
 random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
 copyboolean (default = True)
If True, then copies data then removes mean from data. False might cause data to be overwritten with its mean centered version.
 whitenboolean (default = False)
If True, decorrelates the components. This is done by dividing them by the corresponding singular values then multiplying by sqrt(n_samples). Whitening allows each component to have unit variance and removes multicollinearity. It might be beneficial for downstream tasks like LinearRegression where correlated features cause problems.
Notes
PCA considers linear combinations of features, specifically those that maximise global variance structure. This means PCA is fantastic for global structure analyses, but weak for local relationships. Consider UMAP or TSNE for a locally important embedding.
Applications of PCA
PCA is used extensively in practice for data visualization and data compression. It has been used to visualize extremely large word embeddings like Word2Vec and GloVe in 2 or 3 dimensions, large datasets of everyday objects and images, and used to distinguish between cancerous cells from healthy cells.
For an additional example see the PCA notebook. For additional docs, see scikitlearn’s PCA.
Examples
# Both import methods supported from cuml import PCA from cuml.decomposition import PCA import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) pca_float = PCA(n_components = 2) pca_float.fit(gdf_float) print(f'components: {pca_float.components_}') print(f'explained variance: {pca_float.explained_variance_}') print(f'explained variance ratio: {pca_float.explained_variance_ratio_}') print(f'singular values: {pca_float.singular_values_}') print(f'mean: {pca_float.mean_}') print(f'noise variance: {pca_float.noise_variance_}') trans_gdf_float = pca_float.transform(gdf_float) print(f'Inverse: {trans_gdf_float}') input_gdf_float = pca_float.inverse_transform(trans_gdf_float) print(f'Input: {input_gdf_float}')
Output:
components: 0 1 2 0 0.69225764 0.5102837 0.51028395 1 0.72165036 0.48949987 0.4895003 explained variance: 0 8.510402 1 0.48959687 explained variance ratio: 0 0.9456003 1 0.054399658 singular values: 0 4.1256275 1 0.9895422 mean: 0 2.6666667 1 2.3333333 2 2.3333333 noise variance: 0 0.0 transformed matrix: 0 1 0 2.8547091 0.42891636 1 0.121316016 0.80743366 2 2.9760244 0.37851727 Input Matrix: 0 1 2 0 1.0000001 3.9999993 4.0 1 2.0 2.0000002 1.9999999 2 4.9999995 1.0000006 1.0
 Attributes
 components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
 explained_variance_array
How much each component explains the variance in the data given by S**2
 explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
 singular_values_array
The top K singular values. Remember all singular values >= 0
 mean_array
The column wise mean of X. Used to mean  center the data first.
 noise_variance_float
From Bishop 1999’s Textbook. Used in later tasks like calculating the estimated covariance of X.
Methods
fit
(self, X[, _transform])Fit the model with X.
fit_transform
(self, X[, y])Fit the model with X and apply the dimensionality reduction on X.
get_params
(self[, deep])Sklearn style return parameter state
inverse_transform
(self, X)Transform data back to its original space.
set_params
(self, **parameter)transform
(self, X)Apply dimensionality reduction to X.

fit
(self, X, _transform=False)¶ Fit the model with X.
 Parameters
 XcuDF DataFrame
Dense matrix (floats or doubles) of shape (n_samples, n_features)
 Returns
 cluster labels

fit_transform
(self, X, y=None)¶ Fit the model with X and apply the dimensionality reduction on X.
 Parameters
 XcuDF DataFrame, shape (n_samples, n_features)
training data (floats or doubles), where n_samples is the number of samples, and n_features is the number of features.
 yignored
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

inverse_transform
(self, X)¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
 Parameters
 XcuDF DataFrame, shape (n_samples, n_components)
New data (floats or doubles), where n_samples is the number of samples and n_components is the number of components.
 Returns
 X_originalcuDF DataFrame, shape (n_samples, n_features)

set_params
(self, **parameter)¶

transform
(self, X)¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
 Parameters
 XcuDF DataFrame, shape (n_samples, n_features)
New data (floats or doubles), where n_samples is the number of samples and n_features is the number of features.
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)
Truncated SVD¶

class
cuml.
TruncatedSVD
¶ TruncatedSVD is used to compute the top K singular values and vectors of a large matrix X. It is much faster when n_components is small, such as in the use of PCA when 3 components is used for 3D visualization.
cuML’s TruncatedSVD expects a cuDF DataFrame, and provides 2 algorithms Full and Jacobi. Full (default) uses a full eigendecomposition then selects the top K singular vectors. The Jacobi algorithm is much faster as it iteratively tries to correct the top K singular vectors, but might be less accurate.
 Parameters
 n_componentsint (default = 1)
The number of top K singular vectors / values you want. Must be <= number(columns).
 algorithm‘full’ or ‘jacobi’ or ‘auto’ (default = ‘full’)
Full uses a eigendecomposition of the covariance matrix then discards components. Jacobi is much faster as it iteratively corrects, but is less accurate.
 n_iterint (default = 15)
Used in Jacobi solver. The more iterations, the more accurate, but the slower.
 tolfloat (default = 1e7)
Used if algorithm = “jacobi”. The smaller the tolerance, the more accurate, but the more slower the algorithm will get to converge.
 random_stateint / None (default = None)
If you want results to be the same when you restart Python, select a state.
Notes
TruncatedSVD (the randomized version [Jacobi]) is fantastic when the number of components you want is much smaller than the number of features. The approximation to the largest singular values and vectors is very robust, however, this method loses a lot of accuracy when you want many many components.
Applications of TruncatedSVD
TruncatedSVD is also known as Latent Semantic Indexing (LSI) which tries to find topics of a word count matrix. If X previously was centered with mean removal, TruncatedSVD is the same as TruncatedPCA. TruncatedSVD is also used in information retrieval tasks, recommendation systems and data compression.
For additional examples, see the Truncated SVD notebook. For additional documentation, see scikitlearn’s TruncatedSVD docs.
Examples
# Both import methods supported from cuml import TruncatedSVD from cuml.decomposition import TruncatedSVD import cudf import numpy as np gdf_float = cudf.DataFrame() gdf_float['0'] = np.asarray([1.0,2.0,5.0], dtype = np.float32) gdf_float['1'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) gdf_float['2'] = np.asarray([4.0,2.0,1.0], dtype = np.float32) tsvd_float = TruncatedSVD(n_components = 2, algorithm = "jacobi", n_iter = 20, tol = 1e9) tsvd_float.fit(gdf_float) print(f'components: {tsvd_float.components_}') print(f'explained variance: {tsvd_float.explained_variance_}') print(f'explained variance ratio: {tsvd_float.explained_variance_ratio_}') print(f'singular values: {tsvd_float.singular_values_}') trans_gdf_float = tsvd_float.transform(gdf_float) print(f'Transformed matrix: {trans_gdf_float}') input_gdf_float = tsvd_float.inverse_transform(trans_gdf_float) print(f'Input matrix: {input_gdf_float}')
Output:
components: 0 1 2 0 0.58725953 0.57233137 0.5723314 1 0.80939883 0.41525528 0.4152552 explained variance: 0 55.33908 1 16.660923 explained variance ratio: 0 0.7685983 1 0.23140171 singular values: 0 7.439024 1 4.0817795 Transformed matrix: 0 1 0 5.1659107 2.512643 1 3.4638448 0.042223275 2 4.0809603 3.2164836 Input matrix: 0 1 2 0 1.0 4.000001 4.000001 1 2.0000005 2.0000005 2.0000007 2 5.000001 0.9999999 1.0000004
 Attributes
 components_array
The top K components (VT.T[:,:n_components]) in U, S, VT = svd(X)
 explained_variance_array
How much each component explains the variance in the data given by S**2
 explained_variance_ratio_array
How much in % the variance is explained given by S**2/sum(S**2)
 singular_values_array
The top K singular values. Remember all singular values >= 0
Methods
fit
(self, X[, _transform])Fit LSI model on training cudf DataFrame X.
fit_transform
(self, X)Fit LSI model to X and perform dimensionality reduction on X.
get_params
(self[, deep])Sklearn style return parameter state
inverse_transform
(self, X)Transform X back to its original space.
set_params
(self, **params)Sklearn style set parameter state to dictionary of params.
transform
(self, X)Perform dimensionality reduction on X.

fit
(self, X, _transform=True)¶ Fit LSI model on training cudf DataFrame X.
 Parameters
 XcuDF DataFrame, dense matrix, shape (n_samples, n_features)
Training data (floats or doubles)

fit_transform
(self, X)¶ Fit LSI model to X and perform dimensionality reduction on X.
 Parameters
 X GDFcuDF DataFrame, dense matrix, shape (n_samples, n_features)
Training data (floats or doubles)
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)
Reduced version of X. This will always be a dense cuDF DataFrame

get_params
(self, deep=True)¶ Sklearn style return parameter state
 Parameters
 deepboolean (default = True)

inverse_transform
(self, X)¶ Transform X back to its original space.
Returns a cuDF DataFrame X_original whose transform would be X.
 Parameters
 XcuDF DataFrame, shape (n_samples, n_components)
New data.
 Returns
 X_originalcuDF DataFrame, shape (n_samples, n_features)
Note that this is always a dense cuDF DataFrame.

set_params
(self, **params)¶ Sklearn style set parameter state to dictionary of params.
 Parameters
 paramsdict of new params

transform
(self, X)¶ Perform dimensionality reduction on X.
 Parameters
 XcuDF DataFrame, dense matrix, shape (n_samples, n_features)
New data.
 Returns
 X_newcuDF DataFrame, shape (n_samples, n_components)
Reduced version of X. This will always be a dense DataFrame.
UMAP¶

class
cuml.
UMAP
¶ Uniform Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates an underlying manifold.
Adapted from https://github.com/lmcinnes/umap/blob/master/umap/umap.py
 Parameters
 n_neighbors: float (optional, default 15)
The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.
 n_components: int (optional, default 2)
The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any
 n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
 learning_rate: float (optional, default 1.0)
The initial learning rate for the embedding optimization.
 init: string (optional, default ‘spectral’)
 How to initialize the low dimensional embedding. Options are:
‘spectral’: use a spectral embedding of the fuzzy 1skeleton
‘random’: assign initial embedding positions at random.
 min_dist: float (optional, default 0.1)
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the
spread
value, which determines the scale at which embedded points will be spread out. spread: float (optional, default 1.0)
The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are. set_op_mix_ratio: float (optional, default 1.0)
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product tnorm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
 local_connectivity: int (optional, default 1)
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
 repulsion_strength: float (optional, default 1.0)
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
 negative_sample_rate: int (optional, default 5)
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
 transform_queue_size: float (optional, default 4.0)
For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.
 a: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
. b: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
. verbose: bool (optional, default False)
Controls verbosity of logging.
Notes
This module is heavily based on Leland McInnes’ reference UMAP package. However, there are a number of differences and features that are not yet implemented in cuml.umap:
Specifying the random seed
Using a noneuclidean distance metric (support for a fixed set of noneuclidean metrics is planned for an upcoming release).
Using a precomputed pairwise distance matrix (under consideration for future releases)
Manual initialization of initial embedding positions
In addition to these missing features, you should expect to see the final embeddings differing between cuml.umap and the reference UMAP. In particular, the reference UMAP uses an approximate kNN algorithm for large data sizes while cuml.umap always uses exact kNN.
References
Leland McInnes, John Healy, James Melville UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction https://arxiv.org/abs/1802.03426
Methods
fit
()Fit X into an embedded space.
Fit X into an embedded space and return that transformed output.
Transform X into the existing embedded space and return that transformed output.

fit
()¶ Fit X into an embedded space. Parameters ——— X : array, shape (n_samples, n_features)
X contains a sample per row.
 yarray, shape (n_samples)
y contains a label per row.

fit_transform
()¶ Fit X into an embedded space and return that transformed output. Parameters ——— X : array, shape (n_samples, n_features) or (n_samples, n_samples)
X contains a sample per row.
 X_newarray, shape (n_samples, n_components)
Embedding of the training data in lowdimensional space.

transform
()¶ Transform X into the existing embedded space and return that transformed output.
Please refer to the reference UMAP implementation for information on the differences between fit_transform() and running fit() transform().
Specifically, the transform() function is stochastic: https://github.com/lmcinnes/umap/issues/158
 Parameters
 Xarray, shape (n_samples, n_features)
New data to be transformed.
 Returns
 ——
 X_newarray, shape (n_samples, n_components)
Embedding of the new data in lowdimensional space.