Categorical Encoders#

class legateboost.encoder.TargetEncoder(target_type: str, smooth: str | float = 'auto', cv: int = 5, shuffle: bool = True, random_state: RandomState | None = None)#

TargetEncoder is a transformer that encodes categorical features using the mean of the target variable. When fit_transform is called, a cross- validation procedure is used to generate encodings for each training fold, which are then applied to the test fold. fit().transform() differs from fit_transform() in that the former fits the encoder on all the data and generates encodings for each feature. This encoder is modelled on the sklearn TargetEncoder with only minor differences in how the CV folds are generated. As it is difficult to rearrange and gather data from each fold in distributed environment, training rows are kept in place and then assigned a cv fold by generating a random integer in the range [0, n_folds). As per sklearn, when smooth=”auto”, an empirical Bayes estimate per [1] is used to avoid overfitting.

Parameters:
  • target_type (str) – The type of target variable. Must be one of {“continuous”, “binary”, “multiclass”}.

  • smooth (float, default=1.0) – Smoothing parameter to avoid overfitting. If “auto”, the smoothing parameter is determined automatically.

  • cv (int, default=5) – Number of cross-validation folds. If 0, no cross-validation is performed.

  • shuffle (bool, default=True) – Whether to shuffle the data before splitting into folds.

  • random_state (int or None, default=None) – Seed for the random number generator.

n_features_in_#

Number of features seen during fit.

Type:

int

categories_#

List of unique categories for each feature.

Type:

list of arrays

categories_sparse_matrix_#

Concatenated array of unique categories for all features.

Type:

array

categories_row_pointers_#

Array of row pointers for the concatenated categories array.

Type:

array

encodings_#

List of encoding arrays for each feature.

Type:

list of arrays

target_mean_#

Mean of the target variable.

Type:

array

fit(X: array, y: array) TargetEncoder#

Fit the encoder to the data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input samples.

  • y (array-like of shape (n_samples,)) – The target values. Cannot be None.

Returns:

self – Fitted encoder.

Return type:

object

Raises:

ValueError – If the target y is None.

fit_transform(X: array, y: array) array#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: array) array#

Transforms the input data X using the target encoding.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data to transform.

Returns:

X_out – The transformed data with target encoding applied.

Return type:

ndarray of shape (n_samples, n_features * encoding_dim)

Raises:

ValueError – If the number of features in X does not match the number of features the encoder was fitted with.