Pre-Deployment ML
Algorithm Tuning

Algorithm Tuning

Open In Colab (opens in a new tab)

Algorithm Tuning allows us to test different models on a given dataset, and helps to figure out which particular model gives the highest value of a user-defined performance metric on that particular dataset.

import turboml as tb
import pandas as pd
from sklearn import metrics

Dataset

We use our standard FraudDetection dataset for this example, exposed through the LocalDataset interface that can be used for tuning, and also configure the dataset to indicate the column with the primary key.

For this example, we use the first 100k rows.

transactions = tb.datasets.FraudDetectionDatasetFeatures()
labels = tb.datasets.FraudDetectionDatasetLabels()
transactions_100k = transactions[:100000]
labels_100k = labels[:100000]
 
numerical_fields = [
    "transactionAmount",
]
categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
inputs = transactions_100k.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_model_labels(label_field="is_fraud")

Training/Tuning

We will be comparing the Neural Network and Hoeffding Tree Classifier, and the metric we will be optimizing is accuracy.

Configuring the NN according to the dataset.

new_layer = tb.NNLayer(output_size=2)
 
nn = tb.NeuralNetwork()
nn.layers.append(new_layer)

The algorithm_tuning function takes in the models being tested as a list along with the metric to test against, and returns an object for the model which had the highest score for the given metric.

model_score_list = tb.algorithm_tuning(
    models_to_test=[
        tb.HoeffdingTreeClassifier(n_classes=2),
        nn,
    ],
    metric_to_optimize="accuracy",
    input=inputs,
    labels=label,
)
best_model, best_score = model_score_list[0]
best_model

Testing

After finding out the best performing model, we can use it normally for inference on the entire dataset and testing on more performance metrics.

transactions_test = transactions[100000:]
labels_test = labels[100000:]
features = transactions_test.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
 
outputs = best_model.predict(features)
print(
    "Accuracy: ",
    metrics.accuracy_score(labels_test.df["is_fraud"], outputs["predicted_class"]),
)
print("F1: ", metrics.f1_score(labels_test.df["is_fraud"], outputs["predicted_class"]))