String Encoding

(opens in a new tab)

Textual data needs to be converted into numerical data to be used by ML models. For larger textual data like sentences and paragraphs, we saw in llm_embedding notebook how embeddings from pre-trained languages models can be used. But what about smaller strings, like country name? How do we use such strings as features in our ML models? This notebook covers different encoding methods that TurboML provides for textual features.

import turboml as tb

transactions = tb.datasets.FraudDetectionDatasetFeatures().to_online(
    id="transactions", load_if_exists=True
)
labels = tb.datasets.FraudDetectionDatasetLabels().to_online(
    id="transaction_labels", load_if_exists=True
)

numerical_fields = [
    "transactionAmount",
]
textual_fields = ["transactionCurrencyCode"]
features = transactions.get_model_inputs(
    numerical_fields=numerical_fields, textual_fields=textual_fields
)
label = labels.get_model_labels(label_field="is_fraud")

Notice that now we're extracting a textual feature called transactionCurrencyCode from our dataset. To make sure that the model finally works with numerical data, we can define preprocessors that transform the textual data to numerical data via some encoding methods. By default, TurboML uses the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing (opens in a new tab)) to automatically hash and convert string data to numeric data. However, TurboML also supports popular encoding methods to handle strings including

LabelPreProcessor
OneHotPreProcessor
TargetPreProcessor
FrequencyPreProcessor
BinaryPreProcessor

We'll try an example using FrequencyPreProcessor. For these pre-processors, we need to specify in advance the cardinality of our data, which can be computed as follows.

htc_model = tb.HoeffdingTreeClassifier(n_classes=2)

import pandas as pd
 
demo_classifier = tb.FrequencyPreProcessor(
    text_categories=[
        len(pd.unique(transactions.preview_df[col])) for col in textual_fields
    ],
    base_model=htc_model,
)

deployed_model = demo_classifier.deploy(
    "demo_classifier_htc", input=features, labels=label
)

outputs = deployed_model.get_outputs()

sample_output = outputs[-1]
sample_output

LLM Tutorial Image Input