General
Batch API

Batch APIs

Open In Colab (opens in a new tab)

The main mode of operation in TurboML is streaming, with continuous updates to different components with fresh data. However, TurboML also supports the good ol' fashioned batch APIs. We've already seen examples of this for feature engineering in the quickstart notebook. In this notebook, we'll focus primarily on batch APIs for ML modelling.

To make this more interesting, we'll show how we can still have incremental training on batch data.

import turboml as tb
import pandas as pd
from sklearn import metrics

Dataset

We'll use our standard FraudDetection dataset again, but this time without pushing it to the platform. Interfaces like feature engineering and feature selection work in the exact same ways, just without being linked to a platform-managed dataset.

transactions = tb.datasets.FraudDetectionDatasetFeatures()
labels = tb.datasets.FraudDetectionDatasetLabels()
 
transactions_p1 = transactions[:100000]
labels_p1 = labels[:100000]
 
transactions_p2 = transactions[100000:]
labels_p2 = labels[100000:]
numerical_fields = [
    "transactionAmount",
    "localHour",
]
categorical_fields = [
    "digitalItemCount",
    "physicalItemCount",
    "isProxyIP",
]
features = transactions_p1.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_p1.get_model_labels(label_field="is_fraud")

Training

With the features and label defined, we can train a model in a batch way using the learn method.

model = tb.HoeffdingTreeClassifier(n_classes=2)
model_trained_100K = model.learn(features, label)

We've trained a model on the first 100K rows. Now, to update this model on the remaining data, we can create another batch dataset and call the learn method. Note that this time, learn is called on a trained model.

features = transactions_p2.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_p2.get_model_labels(label_field="is_fraud")
model_fully_trained = model_trained_100K.learn(features, label)

Inference

We've seen batch inference on deployed models in the quickstart notebook. We can also perform batch inference on these models using the predict method.

outputs = model_trained_100K.predict(features)
print(metrics.roc_auc_score(labels_p2.df["is_fraud"], outputs["score"]))
outputs = model_fully_trained.predict(features)
print(metrics.roc_auc_score(labels_p2.df["is_fraud"], outputs["score"]))

Deployment

So far, we've only trained a model. We haven't deployed it yet. Deploying a batch trained model is exactly like any other model deployment, except we'll set the predict_only option to be True. This means the model won't be updated automatically.

transactions = transactions.to_online(id="transactions10", load_if_exists=True)
labels = labels.to_online(id="transaction_labels", load_if_exists=True)
features = transactions.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels.get_model_labels(label_field="is_fraud")
deployed_model = model_fully_trained.deploy(
    name="predict_only_model", input=features, labels=label, predict_only=True
)
outputs = deployed_model.get_outputs()
outputs[-1]

Next Steps

In this notebook, we discussed how to train models in a batch paradigm and deploy them. In a separate notebook we'll cover two different statregies to update models, (i) starting from a batch trained model and using continual learning, (ii) training models incrementally in a batch paradigm and updating the deployment with newer versions.