Pipeline Components
PreProcessors

PreProcessors

Since our preprocessors must also work with streaming data, we define preprocessors by combining them with a base model. Under the hood, we apply the transformation by the preprocessor, and pass the transformed inputs to the base model. This concept is similar to Pipelines in Scikit-Learn.

MinMaxPreProcessor

Works on numerical fields of the input. Scales them between 0 and 1, by maintaining running min and max for all numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

Example Usage

We can create an instance of the MinMaxPreProcessor model like this.

import turboml as tb
embedding = tb.MinMaxPreProcessor(base_model=tb.HoeffdingTreeClassifier(n_classes=2))

NormalPreProcessor

Works on numerical fields of the input. Scales the data so that it has zero mean and unit variance, by maintaining running mean and variance for all numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

Example Usage

We can create an instance of the NormalPreProcessor model like this.

import turboml as tb
embedding = tb.NormalPreProcessor(base_model=tb.HoeffdingTreeClassifier(n_classes=2))

RobustPreProcessor

Works on numerical fields of the input. Scales the data using statistics that are robust to outliers, by removing the running median and scaling by running interquantile range.

Parameters

  • base_model(Model) → The model to call after transforming the input.

Example Usage

We can create an instance of the RobustPreProcessor model like this.

import turboml as tb
embedding = tb.RobustPreProcessor(base_model=tb.HoeffdingTreeClassifier(n_classes=2))

LabelPreProcessor

Works on textual fields of the input. For each textual feature, we need to know in advance the cardinality of that feature. Converts the strings into ordinal integers. The resulting numbers are appended to the numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

  • text_categories(List[int]) → List of cardinalities for each textual feature.

Example Usage

We can create an instance of the LabelPreProcessor model like this.

import turboml as tb
embedding = tb.LabelPreProcessor(text_categories=[5, 10], base_model=tb.HoeffdingTreeClassifier(n_classes=2))

OneHotPreProcessor

Works on textual fields of the input. For each textual feature, we need to know in advance the cardinality of that feature. Converts the strings into one-hot encoding. The resulting numbers are appended to the numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

  • text_categories(List[int]) → List of cardinalities for each textual feature.

Example Usage

We can create an instance of the OneHotPreProcessor model like this.

import turboml as tb
embedding = tb.OneHotPreProcessor(text_categories=[5, 10], base_model=tb.HoeffdingTreeClassifier(n_classes=2))

BinaryPreProcessor

Works on textual fields of the input. For each textual feature, we need to know in advance the cardinality of that feature. Converts the strings into binary encoding. The resulting numbers are appended to the numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

  • text_categories(List[int]) → List of cardinalities for each textual feature.

Example Usage

We can create an instance of the BinaryPreProcessor model like this.

import turboml as tb
embedding = tb.BinaryPreProcessor(text_categories=[5, 10], base_model=tb.HoeffdingTreeClassifier(n_classes=2))

FrequencyPreProcessor

Works on textual fields of the input. For each textual feature, we need to know in advance the cardinality of that feature. Converts the strings into their frequency based on the values seen so far. The resulting numbers are appended to the numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

  • text_categories(List[int]) → List of cardinalities for each textual feature.

Example Usage

We can create an instance of the FrequencyPreProcessor model like this.

import turboml as tb
embedding = tb.FrequencyPreProcessor(text_categories=[5, 10], base_model=tb.HoeffdingTreeClassifier(n_classes=2))

TargetPreProcessor

Works on textual fields of the input. For each textual feature, we need to know in advance the cardinality of that feature. Converts the strings into average target value seen for them so far. The resulting numbers are appended to the numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

  • text_categories(List[int]) → List of cardinalities for each textual feature.

Example Usage

We can create an instance of the TargetPreProcessor model like this.

import turboml as tb
embedding = tb.TargetPreProcessor(text_categories=[5, 10], base_model=tb.HoeffdingTreeClassifier(n_classes=2))

LlamaCppPreProcessor

Works on textual fields of the input. Converts the text features into their embeddings obtained from a pre-trained language model. The resulting embeddings are appended to the numerical features.

Parameters

  • base_model(Model) → The model to call after transforming the input.

  • gguf_model_id(List[int]) → A model id issued by tb.acquire_hf_model_as_gguf.

  • max_tokens_per_input(int) → The maximum number of tokens to consider in the input text. Tokens beyond this limit will be truncated. Default is 512.

Example Usage

We can create an instance of the LlamaCppPreProcessor model like this.

import turboml as tb
embedding = tb.LlamaCppPreProcessor(gguf_model_id=tb.acquire_hf_model_as_gguf("BAAI/bge-small-en-v1.5", "f16"), max_tokens_per_input=512, base_model=tb.HoeffdingTreeClassifier(n_classes=2))