General
Batch API

Batch APIs

Open In Colab (opens in a new tab)

The main mode of operation in TurboML is streaming, with continuous updates to different components with fresh data. However, TurboML also supports the good ol' fashioned batch APIs. We've already seen examples of this for feature engineering in the quickstart notebook. In this notebook, we'll focus primarily on batch APIs for ML modelling.

To make this more interesting, we'll show how we can still have incremental training on batch data.

    import turboml as tb
    import pandas as pd
    from sklearn import metrics
    transactions_df = pd.read_csv("data/transactions.csv").reset_index()
    labels_df = pd.read_csv("data/labels.csv").reset_index()

Dataset

We can use the same PandasDataset class to create a batch dataset by setting the streaming argument to False. With this, functions like feature engineering, extracting inputs/labels remains the same.

We're creating this dataset only using the first 100K rows.

    transactions_100k = tb.PandasDataset(
        dataframe=transactions_df[:100000], key_field="index", streaming=False
    )
    labels_100k = tb.PandasDataset(
        dataframe=labels_df[:100000], key_field="index", streaming=False
    )
    numerical_fields = [
        "transactionAmount",
        "localHour",
    ]
    categorical_fields = [
        "digitalItemCount",
        "physicalItemCount",
        "isProxyIP",
    ]
    features = transactions_100k.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    label = labels_100k.get_label_field(label_field="is_fraud")

Training

With the features and label defined, we can train a model in a batch way using the learn method.

    model = tb.HoeffdingTreeClassifier(n_classes=2)
    model_trained_100K = model.learn(features, label)

We've trained a model on the first 100K rows. Now, to update this model on the remaining data, we can create another batch dataset and call the learn method. Note that this time, learn is called on a trained model.

    transactions_full = tb.PandasDataset(
        dataframe=transactions_df[100000:], key_field="index", streaming=False
    )
    labels_full = tb.PandasDataset(
        dataframe=labels_df[100000:], key_field="index", streaming=False
    )
    
    features = transactions_full.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    label = labels_full.get_label_field(label_field="is_fraud")
    model_fully_trained = model_trained_100K.learn(features, label)

Inference

We've seen batch inference on deployed models in the quickstart notebook. We can also perform batch inference on these models using the predict method.

    outputs = model_trained_100K.predict(features)
    print(metrics.roc_auc_score(labels_df["is_fraud"][100000:], outputs["score"]))
    outputs = model_fully_trained.predict(features)
    print(metrics.roc_auc_score(labels_df["is_fraud"][100000:], outputs["score"]))

Deployment

So far, we've only trained a model. We haven't deployed it yet. Deploying a batch trained model is exactly like any other model deployment, except we'll set the predict_only option to be True. This means the model won't be updated automatically.

    transactions = tb.PandasDataset(
        dataset_name="transactions_batch_api",
        key_field="index",
        dataframe=transactions_df,
        upload=True,
    )
    labels = tb.PandasDataset(
        dataset_name="labels_batch_api", key_field="index", dataframe=labels_df, upload=True
    )
    features = transactions.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    label = labels.get_label_field(label_field="is_fraud")
    deployed_model = model_fully_trained.deploy(
        name="predict_only_model", input=features, labels=label, predict_only=True
    )
    outputs = deployed_model.get_outputs()
    outputs[-1]

Next Steps

In this notebook, we discussed how to train models in a batch paradigm and deploy them. In a separate notebook we'll cover two different statregies to update models, (i) starting from a batch trained model and using continual learning, (ii) training models incrementally in a batch paradigm and updating the deployment with newer versions.