Feature Extraction#

Text Feature Extraction#

The following two vectorizers are designed to be used in conjunction via a pipeline, such as:

from sklearn.pipeline import Pipeline

from legate_raft.sklearn.feature_extraction.text import (
    HashingVectorizer, TfidfTransformer)
from legate_raft.sklearn.naive_bayes import MultinomialNB

bayesTfIDF = Pipeline(
    [
        ("hv", HashingVectorizer(n_features=2**17)),
        ("tf-idf", TfidfTransformer()),
        ("mnb", MultinomialNB()),
    ]
)

The following are the first two classes defined:

class legate_raft.sklearn.feature_extraction.text.HashingVectorizer(n_features: int, *, seed: int = 42)#

Convert a collection of texts to a matrix of occurrences, this mirrors sklearn.feature_extraction.text.HashingVectorizer.

Meant to be used together with the the TfidfTransformer and MultinomialNB.

Parameters:
  • n_features (int) – Number of features in the output.

  • seed (int)

fit(X=None) HashingVectorizer#
fit_transform(column: LogicalColumn) COOStore#
transform(column: LogicalColumn) COOStore#
class legate_raft.sklearn.feature_extraction.text.TfidfTransformer(*, norm=None, use_idf=True, smooth_idf=True, output_type=None)#

Transform a count matrix to normalized representation, see sklearn.feature_extraction.text.TfidfTransformer which this mirrors.

Meant to be used together with the the HashingVectorizer and MultinomialNB.

Parameters:
  • norm (None) – Included to mirror scikit-learn

  • use_idf (True) – Included to mirror scikit-learn

  • smooth_idf (bool, default=True) – Prevent division by zero (see scikit-learn documentation for details).

  • output_type (legate type, default=float64) – The numerical output type. Defaults to float64.

fit(X, y=None) TfidfTransformer#
fit_transform(X, y=None)#
transform(X)#