Pre-Deployment ML
Hyperparameter Tuning

Hyperparameter Tuning

Open In Colab (opens in a new tab)

Hyperparameter Tuning uses grid search to scan through a given hyperparameter space for a model and find out the best combination of hyperparameters with respect to a given performance metric.

Importing the necessary modules and reading the dataset.

    import turboml as tb
    import pandas as pd
    from sklearn import metrics
    transactions_df = pd.read_csv("data/transactions.csv").reset_index()
    labels_df = pd.read_csv("data/labels.csv").reset_index()

Dataset

We use the PandasDataset class to create a dataset to be used for tuning.

For this example, we use the first 100k rows.

    transactions_100k = tb.PandasDataset(
        dataframe=transactions_df[:100000], key_field="index", streaming=False
    )
    labels_100k = tb.PandasDataset(
        dataframe=labels_df[:100000], key_field="index", streaming=False
    )
    numerical_fields = ["transactionAmount", "localHour"]
    categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
    inputs = transactions_100k.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    label = labels_100k.get_label_field(label_field="is_fraud")

Training/Tuning

We will be using the AdaBoost Classifier with Hoeffding Tree Classifier being the base model as an example.

    model_to_tune = tb.AdaBoostClassifier(
        n_classes=2, base_model=tb.HoeffdingTreeClassifier(n_classes=2)
    )

Since a particular model object can include other base models and PreProcessors as well, the hyperparameter_tuning function accepts a list of hyperparameter spaces for all such models as part of the model parameter, and tests all possible combinations across the different spaces.

In this example, the first dictionary in the list corresponds to the hyperparameters of AdaBoostClassifier while the second dictionary is the hyperparameter space for the HoeffdingTreeClassifier.

It is not necessary to include all possible hyperparameters in the space; default values are taken for those not specified

    model_score_list = tb.hyperparameter_tuning(
        metric_to_optimize="accuracy",
        model=model_to_tune,
        hyperparameter_space=[
            {"n_models": [2, 3]},
            {
                "delta": [1e-7, 1e-5, 1e-3],
                "tau": [0.05, 0.01, 0.1],
                "grace_period": [200, 100, 500],
                "n_classes": [2],
                "leaf_pred_method": ["mc"],
                "split_method": ["gini", "info_gain", "hellinger"],
            },
        ],
        input=inputs,
        labels=label,
    )
    best_model, best_score = model_score_list[0]
    best_model
    transactions_full = tb.PandasDataset(
        dataframe=transactions_df, key_field="index", streaming=False
    )
    features = transactions_full.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    
    outputs = best_model.predict(features)
    print(
        "Accuracy: ",
        metrics.accuracy_score(labels_df["is_fraud"], outputs["predicted_class"]),
    )
    print("F1: ", metrics.f1_score(labels_df["is_fraud"], outputs["predicted_class"]))