Algorithm Tuning
Algorithm Tuning allows us to test different models on a given dataset, and helps to figure out which particular model gives the highest value of a user-defined performance metric on that particular dataset.
import turboml as tb
import pandas as pd
from sklearn import metrics
Dataset
We use our standard FraudDetection
dataset for this example, exposed through the LocalDataset
interface that can be used for tuning, and also configure the dataset to indicate the column with the primary key.
For this example, we use the first 100k rows.
transactions = tb.datasets.FraudDetectionDatasetFeatures()
labels = tb.datasets.FraudDetectionDatasetLabels()
transactions_100k = transactions[:100000]
labels_100k = labels[:100000]
numerical_fields = [
"transactionAmount",
]
categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
inputs = transactions_100k.get_model_inputs(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_model_labels(label_field="is_fraud")
Training/Tuning
We will be comparing the Neural Network
and Hoeffding Tree Classifier
, and the metric we will be optimizing is accuracy
.
Configuring the NN according to the dataset.
new_layer = tb.NNLayer(output_size=2)
nn = tb.NeuralNetwork()
nn.layers.append(new_layer)
The algorithm_tuning
function takes in the models being tested as a list along with the metric to test against, and returns an object for the model which had the highest score for the given metric.
model_score_list = tb.algorithm_tuning(
models_to_test=[
tb.HoeffdingTreeClassifier(n_classes=2),
nn,
],
metric_to_optimize="accuracy",
input=inputs,
labels=label,
)
best_model, best_score = model_score_list[0]
best_model
Testing
After finding out the best performing model, we can use it normally for inference on the entire dataset and testing on more performance metrics.
transactions_test = transactions[100000:]
labels_test = labels[100000:]
features = transactions_test.get_model_inputs(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
outputs = best_model.predict(features)
print(
"Accuracy: ",
metrics.accuracy_score(labels_test.df["is_fraud"], outputs["predicted_class"]),
)
print("F1: ", metrics.f1_score(labels_test.df["is_fraud"], outputs["predicted_class"]))