Feature Extraction#

Text Feature Extraction#

The following two vectorizers are designed to be used in conjunction via a pipeline, such as:

from sklearn.pipeline import Pipeline

from legate_raft.sklearn.feature_extraction.text import (
    HashingVectorizer, TfidfTransformer)
from legate_raft.sklearn.naive_bayes import MultinomialNB

bayesTfIDF = Pipeline(
    [
        ("hv", HashingVectorizer(n_features=2**17)),
        ("tf-idf", TfidfTransformer()),
        ("mnb", MultinomialNB()),
    ]
)

The following are the first two classes defined:

class legate_raft.sklearn.feature_extraction.text.HashingVectorizer(n_features: int, *, seed: int = 42)#

Convert a collection of texts to a matrix of occurrences, this mirrors sklearn.feature_extraction.text.HashingVectorizer.

Meant to be used together with the the TfidfTransformer and MultinomialNB.

Parameters:

n_features (int) – Number of features in the output.
seed (int)

fit(X=None) → HashingVectorizer#

fit_transform(column: LogicalColumn) → COOStore#

transform(column: LogicalColumn) → COOStore#

class legate_raft.sklearn.feature_extraction.text.TfidfTransformer(*, norm=None, use_idf=True, smooth_idf=True, output_type=None)#

Transform a count matrix to normalized representation, see sklearn.feature_extraction.text.TfidfTransformer which this mirrors.

Meant to be used together with the the HashingVectorizer and MultinomialNB.

Parameters:

norm (None) – Included to mirror scikit-learn
use_idf (True) – Included to mirror scikit-learn
smooth_idf (bool, default=True) – Prevent division by zero (see scikit-learn documentation for details).
output_type (legate type, default=float64) – The numerical output type. Defaults to float64.

fit(X, y=None) → TfidfTransformer#

fit_transform(X, y=None)#

transform(X)#