Batch APIs
The main mode of operation in TurboML is streaming, with continuous updates to different components with fresh data. However, TurboML also supports the good ol' fashioned batch APIs. We've already seen examples of this for feature engineering in the quickstart notebook. In this notebook, we'll focus primarily on batch APIs for ML modelling.
To make this more interesting, we'll show how we can still have incremental training on batch data.
import turboml as tb
import pandas as pd
from sklearn import metrics
transactions_df = pd.read_csv("data/transactions.csv").reset_index()
labels_df = pd.read_csv("data/labels.csv").reset_index()
Dataset
We can use the same PandasDataset class to create a batch dataset by setting the streaming
argument to False. With this, functions like feature engineering, extracting inputs/labels remains the same.
We're creating this dataset only using the first 100K rows.
transactions_100k = tb.PandasDataset(
dataframe=transactions_df[:100000], key_field="index", streaming=False
)
labels_100k = tb.PandasDataset(
dataframe=labels_df[:100000], key_field="index", streaming=False
)
numerical_fields = [
"transactionAmount",
"localHour",
]
categorical_fields = [
"digitalItemCount",
"physicalItemCount",
"isProxyIP",
]
features = transactions_100k.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_label_field(label_field="is_fraud")
Training
With the features and label defined, we can train a model in a batch way using the learn method.
model = tb.HoeffdingTreeClassifier(n_classes=2)
model_trained_100K = model.learn(features, label)
We've trained a model on the first 100K rows. Now, to update this model on the remaining data, we can create another batch dataset and call the learn
method. Note that this time, learn is called on a trained model.
transactions_full = tb.PandasDataset(
dataframe=transactions_df[100000:], key_field="index", streaming=False
)
labels_full = tb.PandasDataset(
dataframe=labels_df[100000:], key_field="index", streaming=False
)
features = transactions_full.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_full.get_label_field(label_field="is_fraud")
model_fully_trained = model_trained_100K.learn(features, label)
Inference
We've seen batch inference on deployed models in the quickstart notebook. We can also perform batch inference on these models using the predict
method.
outputs = model_trained_100K.predict(features)
print(metrics.roc_auc_score(labels_df["is_fraud"][100000:], outputs["score"]))
outputs = model_fully_trained.predict(features)
print(metrics.roc_auc_score(labels_df["is_fraud"][100000:], outputs["score"]))
Deployment
So far, we've only trained a model. We haven't deployed it yet. Deploying a batch trained model is exactly like any other model deployment, except we'll set the predict_only
option to be True. This means the model won't be updated automatically.
transactions = tb.PandasDataset(
dataset_name="transactions_batch_api",
key_field="index",
dataframe=transactions_df,
upload=True,
)
labels = tb.PandasDataset(
dataset_name="labels_batch_api", key_field="index", dataframe=labels_df, upload=True
)
features = transactions.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels.get_label_field(label_field="is_fraud")
deployed_model = model_fully_trained.deploy(
name="predict_only_model", input=features, labels=label, predict_only=True
)
outputs = deployed_model.get_outputs()
outputs[-1]
Next Steps
In this notebook, we discussed how to train models in a batch paradigm and deploy them. In a separate notebook we'll cover two different statregies to update models, (i) starting from a batch trained model and using continual learning, (ii) training models incrementally in a batch paradigm and updating the deployment with newer versions.