Hyperparameter Tuning
Hyperparameter Tuning uses grid search to scan through a given hyperparameter space for a model and find out the best combination of hyperparameters with respect to a given performance metric.
Importing the necessary modules and reading the dataset.
import turboml as tb
import pandas as pd
from sklearn import metrics
transactions_df = pd.read_csv("data/transactions.csv").reset_index()
labels_df = pd.read_csv("data/labels.csv").reset_index()
Dataset
We use the PandasDataset
class to create a dataset to be used for tuning.
For this example, we use the first 100k rows.
transactions_100k = tb.PandasDataset(
dataframe=transactions_df[:100000], key_field="index", streaming=False
)
labels_100k = tb.PandasDataset(
dataframe=labels_df[:100000], key_field="index", streaming=False
)
numerical_fields = ["transactionAmount", "localHour"]
categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
inputs = transactions_100k.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_label_field(label_field="is_fraud")
Training/Tuning
We will be using the AdaBoost Classifier
with Hoeffding Tree Classifier
being the base model as an example.
model_to_tune = tb.AdaBoostClassifier(
n_classes=2, base_model=tb.HoeffdingTreeClassifier(n_classes=2)
)
Since a particular model object can include other base models and PreProcessors as well, the hyperparameter_tuning
function accepts a list of hyperparameter spaces for all such models as part of the model
parameter, and tests all possible combinations across the different spaces.
In this example, the first dictionary in the list corresponds to the hyperparameters of AdaBoostClassifier
while the second dictionary is the hyperparameter space for the HoeffdingTreeClassifier
.
It is not necessary to include all possible hyperparameters in the space; default values are taken for those not specified
model_score_list = tb.hyperparameter_tuning(
metric_to_optimize="accuracy",
model=model_to_tune,
hyperparameter_space=[
{"n_models": [2, 3]},
{
"delta": [1e-7, 1e-5, 1e-3],
"tau": [0.05, 0.01, 0.1],
"grace_period": [200, 100, 500],
"n_classes": [2],
"leaf_pred_method": ["mc"],
"split_method": ["gini", "info_gain", "hellinger"],
},
],
input=inputs,
labels=label,
)
best_model, best_score = model_score_list[0]
best_model
transactions_full = tb.PandasDataset(
dataframe=transactions_df, key_field="index", streaming=False
)
features = transactions_full.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
outputs = best_model.predict(features)
print(
"Accuracy: ",
metrics.accuracy_score(labels_df["is_fraud"], outputs["predicted_class"]),
)
print("F1: ", metrics.f1_score(labels_df["is_fraud"], outputs["predicted_class"]))