Batch APIs
The main mode of operation in TurboML is streaming, with continuous updates to different components with fresh data. However, TurboML also supports the good ol' fashioned batch APIs. We've already seen examples of this for feature engineering in the quickstart notebook. In this notebook, we'll focus primarily on batch APIs for ML modelling.
To make this more interesting, we'll show how we can still have incremental training on batch data.
import turboml as tb
import pandas as pd
from sklearn import metrics
Dataset
We'll use our standard FraudDetection
dataset again, but this time without pushing it to the platform. Interfaces like feature engineering and feature selection work in the exact same ways, just without being linked
to a platform-managed dataset.
transactions = tb.datasets.FraudDetectionDatasetFeatures()
labels = tb.datasets.FraudDetectionDatasetLabels()
transactions_p1 = transactions[:100000]
labels_p1 = labels[:100000]
transactions_p2 = transactions[100000:]
labels_p2 = labels[100000:]
numerical_fields = [
"transactionAmount",
"localHour",
]
categorical_fields = [
"digitalItemCount",
"physicalItemCount",
"isProxyIP",
]
features = transactions_p1.get_model_inputs(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_p1.get_model_labels(label_field="is_fraud")
Training
With the features and label defined, we can train a model in a batch way using the learn method.
model = tb.HoeffdingTreeClassifier(n_classes=2)
model_trained_100K = model.learn(features, label)
We've trained a model on the first 100K rows. Now, to update this model on the remaining data, we can create another batch dataset and call the learn
method. Note that this time, learn is called on a trained model.
features = transactions_p2.get_model_inputs(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_p2.get_model_labels(label_field="is_fraud")
model_fully_trained = model_trained_100K.learn(features, label)
Inference
We've seen batch inference on deployed models in the quickstart notebook. We can also perform batch inference on these models using the predict
method.
outputs = model_trained_100K.predict(features)
print(metrics.roc_auc_score(labels_p2.df["is_fraud"], outputs["score"]))
outputs = model_fully_trained.predict(features)
print(metrics.roc_auc_score(labels_p2.df["is_fraud"], outputs["score"]))
Deployment
So far, we've only trained a model. We haven't deployed it yet. Deploying a batch trained model is exactly like any other model deployment, except we'll set the predict_only
option to be True. This means the model won't be updated automatically.
transactions = transactions.to_online(id="transactions10", load_if_exists=True)
labels = labels.to_online(id="transaction_labels", load_if_exists=True)
features = transactions.get_model_inputs(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels.get_model_labels(label_field="is_fraud")
deployed_model = model_fully_trained.deploy(
name="predict_only_model", input=features, labels=label, predict_only=True
)
outputs = deployed_model.get_outputs()
outputs[-1]
Next Steps
In this notebook, we discussed how to train models in a batch paradigm and deploy them. In a separate notebook we'll cover two different statregies to update models, (i) starting from a batch trained model and using continual learning, (ii) training models incrementally in a batch paradigm and updating the deployment with newer versions.