Pre-Deployment ML
Performance Improvements

Performance Improvements

Open In Colab (opens in a new tab)

In this notebook, we'll cover some examples of how model performance can be improved. The techniques covered are

  • Sampling for imbalanced learning
  • Bagging
  • Boosting
  • Continuous Model Selection using Bandits.
    import turboml as tb
    import pandas as pd
    import numpy as np
    from sklearn.metrics import roc_auc_score
    transactions_df = pd.read_csv("data/transactions.csv").reset_index()
    labels_df = pd.read_csv("data/labels.csv").reset_index()
    transactions = tb.PandasDataset(
        dataset_name="transactions_performance_improve",
        key_field="index",
        dataframe=transactions_df,
        upload=True,
    )
    labels = tb.PandasDataset(
        dataset_name="labels_performance_improve",
        key_field="index",
        dataframe=labels_df,
        upload=True,
    )
    numerical_fields = [
        "transactionAmount",
        "localHour",
    ]
    categorical_fields = [
        "digitalItemCount",
        "physicalItemCount",
        "isProxyIP",
    ]
    features = transactions.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    label = labels.get_label_field(label_field="is_fraud")

Now that we have our setup ready, let's first see the performance of a base HoeffdingTreeClassfier model.

    htc_model = tb.HoeffdingTreeClassifier(n_classes=2)
    deployed_model = htc_model.deploy("htc_classifier", input=features, labels=label)
    outputs = deployed_model.get_outputs()
    len(outputs)
    true_labels = labels_df["is_fraud"].values
    real_outputs = np.array([x["record"].predicted_class for x in outputs])
    roc_auc_score(true_labels, real_outputs)

Not bad. But can we improve it further? We haven't yet used the fact that the dataset is highly skewed.

Sampling for Imbalanaced Learning

    sampler_model = tb.RandomSampler(
        n_classes=2, desired_dist=[0.5, 0.5], sampling_method="under", base_model=htc_model
    )
    deployed_model = sampler_model.deploy(
        "undersampler_model", input=features, labels=label
    )
    outputs = deployed_model.get_outputs()
    len(outputs)
    real_outputs = np.array([x["record"].predicted_class for x in outputs])
    roc_auc_score(true_labels, real_outputs)

Bagging

    lbc_model = tb.LeveragingBaggingClassifier(n_classes=2, base_model=htc_model)
    deployed_model = lbc_model.deploy("lbc_classifier", input=features, labels=label)
    outputs = deployed_model.get_outputs()
    len(outputs)
    real_outputs = np.array([x["record"].predicted_class for x in outputs])
    roc_auc_score(true_labels, real_outputs)

Boosting

    abc_model = tb.AdaBoostClassifier(n_classes=2, base_model=htc_model)
    deployed_model = abc_model.deploy("abc_classifier", input=features, labels=label)
    outputs = deployed_model.get_outputs()
    len(outputs)
    real_outputs = np.array([x["record"].predicted_class for x in outputs])
    roc_auc_score(true_labels, real_outputs)

Continuous Model Selection with Bandits

    bandit_model = tb.BanditModelSelection(base_models=[htc_model, lbc_model, abc_model])
    deployed_model = bandit_model.deploy(
        "demo_classifier_bandit", input=features, labels=label
    )
    outputs = deployed_model.get_outputs()
    len(outputs)
    real_outputs = np.array([x["record"].predicted_class for x in outputs])
    roc_auc_score(true_labels, real_outputs)