Non-Numeric Inputs
String Encoding

String Encoding

Open In Colab (opens in a new tab)

Textual data needs to be converted into numerical data to be used by ML models. For larger textual data like sentences and paragraphs, we saw in llm_embedding notebook how embeddings from pre-trained languages models can be used. But what about smaller strings, like country name? How do we use such strings as features in our ML models? This notebook covers different encoding methods that TurboML provides for textual features.

    import turboml as tb
    import pandas as pd
    transactions_df = pd.read_csv("data/transactions.csv").reset_index()
    labels_df = pd.read_csv("data/labels.csv").reset_index()
    transactions = tb.PandasDataset(
        dataset_name="transactions_str_encoding",
        key_field="index",
        dataframe=transactions_df,
        upload=True,
    )
    labels = tb.PandasDataset(
        dataset_name="labels_str_encoding",
        key_field="index",
        dataframe=labels_df,
        upload=True,
    )
    numerical_fields = [
        "transactionAmount",
    ]
    textual_fields = ["transactionCurrencyCode"]
    features = transactions.get_input_fields(
        numerical_fields=numerical_fields, textual_fields=textual_fields
    )
    label = labels.get_label_field(label_field="is_fraud")

Notice that now we're extracting a textual feature called transactionCurrencyCode from our dataset. To make sure that the model finally works with numerical data, we can define preprocessors that transform the textual data to numerical data via some encoding methods. By default, TurboML uses the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing (opens in a new tab)) to automatically hash and convert string data to numeric data. However, TurboML also supports popular encoding methods to handle strings including

  • LabelPreProcessor
  • OneHotPreProcessor
  • TargetPreProcessor
  • FrequencyPreProcessor
  • BinaryPreProcessor

We'll try an example using FrequencyPreProcessor. For these pre-processors, we need to specify in advance the cardinality of our data, which can be computed as follows.

    htc_model = tb.HoeffdingTreeClassifier(n_classes=2)
    demo_classifier = tb.FrequencyPreProcessor(
        text_categories=[len(pd.unique(transactions_df[col])) for col in textual_fields],
        base_model=htc_model,
    )
    deployed_model = demo_classifier.deploy(
        "demo_classifier_htc", input=features, labels=label
    )
    outputs = deployed_model.get_outputs()
    sample_output = outputs[-1]
    sample_output