Hyperparameter Tuning

(opens in a new tab)

Hyperparameter Tuning uses grid search to scan through a given hyperparameter space for a model and find out the best combination of hyperparameters with respect to a given performance metric.

import turboml as tb

from sklearn import metrics

Dataset

We use our standard FraudDetection dataset for this example, exposed through the LocalDataset interface that can be used for tuning, and also configure the dataset to indicate the column with the primary key.

For this example, we use the first 100k rows.

transactions = tb.datasets.FraudDetectionDatasetFeatures()
labels = tb.datasets.FraudDetectionDatasetLabels()
 
transactions_100k = transactions[:100000]
labels_100k = labels[:100000]

numerical_fields = ["transactionAmount", "localHour"]
categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
inputs = transactions_100k.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_model_labels(label_field="is_fraud")

Training/Tuning

We will be using the AdaBoost Classifier with Hoeffding Tree Classifier being the base model as an example.

model_to_tune = tb.AdaBoostClassifier(
    n_classes=2, base_model=tb.HoeffdingTreeClassifier(n_classes=2)
)

Since a particular model object can include other base models and PreProcessors as well, the hyperparameter_tuning function accepts a list of hyperparameter spaces for all such models as part of the model parameter, and tests all possible combinations across the different spaces.

In this example, the first dictionary in the list corresponds to the hyperparameters of AdaBoostClassifier while the second dictionary is the hyperparameter space for the HoeffdingTreeClassifier.

It is not necessary to include all possible hyperparameters in the space; default values are taken for those not specified

model_score_list = tb.hyperparameter_tuning(
    metric_to_optimize="accuracy",
    model=model_to_tune,
    hyperparameter_space=[
        {"n_models": [2, 3]},
        {
            "delta": [1e-7, 1e-5, 1e-3],
            "tau": [0.05, 0.01, 0.1],
            "grace_period": [200, 100, 500],
            "n_classes": [2],
            "leaf_pred_method": ["mc"],
            "split_method": ["gini", "info_gain", "hellinger"],
        },
    ],
    input=inputs,
    labels=label,
)
best_model, best_score = model_score_list[0]
best_model

features = transactions.get_model_inputs(
    numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
 
outputs = best_model.predict(features)

labels_df = labels.df
print(
    "Accuracy: ",
    metrics.accuracy_score(labels_df["is_fraud"], outputs["predicted_class"]),
)
print("F1: ", metrics.f1_score(labels_df["is_fraud"], outputs["predicted_class"]))

Algorithm Tuning Performance Improvements