Feature Extraction#
Text Feature Extraction#
The following two vectorizers are designed to be used in conjunction via a pipeline, such as:
from sklearn.pipeline import Pipeline
from legate_raft.sklearn.feature_extraction.text import (
HashingVectorizer, TfidfTransformer)
from legate_raft.sklearn.naive_bayes import MultinomialNB
bayesTfIDF = Pipeline(
[
("hv", HashingVectorizer(n_features=2**17)),
("tf-idf", TfidfTransformer()),
("mnb", MultinomialNB()),
]
)
The following are the first two classes defined:
- class legate_raft.sklearn.feature_extraction.text.HashingVectorizer(n_features: int, *, seed: int = 42)#
Convert a collection of texts to a matrix of occurrences, this mirrors
sklearn.feature_extraction.text.HashingVectorizer
.Meant to be used together with the the
TfidfTransformer
andMultinomialNB
.- Parameters:
n_features (int) – Number of features in the output.
seed (int)
- fit(X=None) HashingVectorizer #
- fit_transform(column: LogicalColumn) COOStore #
- transform(column: LogicalColumn) COOStore #
- class legate_raft.sklearn.feature_extraction.text.TfidfTransformer(*, norm=None, use_idf=True, smooth_idf=True, output_type=None)#
Transform a count matrix to normalized representation, see
sklearn.feature_extraction.text.TfidfTransformer
which this mirrors.Meant to be used together with the the
HashingVectorizer
andMultinomialNB
.- Parameters:
norm (None) – Included to mirror scikit-learn
use_idf (True) – Included to mirror scikit-learn
smooth_idf (bool, default=True) – Prevent division by zero (see scikit-learn documentation for details).
output_type (legate type, default=float64) – The numerical output type. Defaults to
float64
.
- fit(X, y=None) TfidfTransformer #
- fit_transform(X, y=None)#
- transform(X)#