Pre-Deployment ML
Algorithm Tuning

Algorithm Tuning

Open In Colab (opens in a new tab)

Algorithm Tuning allows us to test different models on a given dataset, and helps to figure out which particular model gives the highest value of a user-defined performance metric on that particular dataset.

Importing the necessary modules and reading the dataset.

    import turboml as tb
    import pandas as pd
    from sklearn import metrics
    transactions_df = pd.read_csv("data/transactions.csv").reset_index()
    labels_df = pd.read_csv("data/labels.csv").reset_index()

Dataset

We use the PandasDataset class to create a dataset to be used for tuning, and also configure the dataset to indicate the column with the primary key.

For this example, we use the first 100k rows.

    transactions_100k = tb.PandasDataset(
        dataframe=transactions_df[:100000], key_field="index", streaming=False
    )
    labels_100k = tb.PandasDataset(
        dataframe=labels_df[:100000], key_field="index", streaming=False
    )
    numerical_fields = [
        "transactionAmount",
    ]
    categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
    inputs = transactions_100k.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    label = labels_100k.get_label_field(label_field="is_fraud")

Training/Tuning

We will be comparing the Neural Network and Hoeffding Tree Classifier, and the metric we will be optimizing is accuracy.

Configuring the NN according to the dataset.

    new_layer = tb.NNLayer(output_size=2)
    
    nn = tb.NeuralNetwork()
    nn.layers.append(new_layer)

The algorithm_tuning function takes in the models being tested as a list along with the metric to test against, and returns an object for the model which had the highest score for the given metric.

    model_score_list = tb.algorithm_tuning(
        models_to_test=[
            tb.HoeffdingTreeClassifier(n_classes=2),
            nn,
        ],
        metric_to_optimize="accuracy",
        input=inputs,
        labels=label,
    )
    best_model, best_score = model_score_list[0]
    best_model

Testing

After finding out the best performing model, we can use it normally for inference on the entire dataset and testing on more performance metrics.

    transactions_full = tb.PandasDataset(
        dataframe=transactions_df, key_field="index", streaming=False
    )
    features = transactions_full.get_input_fields(
        numerical_fields=numerical_fields, categorical_fields=categorical_fields
    )
    
    outputs = best_model.predict(features)
    print(
        "Accuracy: ",
        metrics.accuracy_score(labels_df["is_fraud"], outputs["predicted_class"]),
    )
    print("F1: ", metrics.f1_score(labels_df["is_fraud"], outputs["predicted_class"]))