Algorithm Tuning
Algorithm Tuning allows us to test different models on a given dataset, and helps to figure out which particular model gives the highest value of a user-defined performance metric on that particular dataset.
Importing the necessary modules and reading the dataset.
import turboml as tb
import pandas as pd
from sklearn import metrics
transactions_df = pd.read_csv("data/transactions.csv").reset_index()
labels_df = pd.read_csv("data/labels.csv").reset_index()
Dataset
We use the PandasDataset
class to create a dataset to be used for tuning, and also configure the dataset to indicate the column with the primary key.
For this example, we use the first 100k rows.
transactions_100k = tb.PandasDataset(
dataframe=transactions_df[:100000], key_field="index", streaming=False
)
labels_100k = tb.PandasDataset(
dataframe=labels_df[:100000], key_field="index", streaming=False
)
numerical_fields = [
"transactionAmount",
]
categorical_fields = ["digitalItemCount", "physicalItemCount", "isProxyIP"]
inputs = transactions_100k.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
label = labels_100k.get_label_field(label_field="is_fraud")
Training/Tuning
We will be comparing the Neural Network
and Hoeffding Tree Classifier
, and the metric we will be optimizing is accuracy
.
Configuring the NN according to the dataset.
new_layer = tb.NNLayer(output_size=2)
nn = tb.NeuralNetwork()
nn.layers.append(new_layer)
The algorithm_tuning
function takes in the models being tested as a list along with the metric to test against, and returns an object for the model which had the highest score for the given metric.
model_score_list = tb.algorithm_tuning(
models_to_test=[
tb.HoeffdingTreeClassifier(n_classes=2),
nn,
],
metric_to_optimize="accuracy",
input=inputs,
labels=label,
)
best_model, best_score = model_score_list[0]
best_model
Testing
After finding out the best performing model, we can use it normally for inference on the entire dataset and testing on more performance metrics.
transactions_full = tb.PandasDataset(
dataframe=transactions_df, key_field="index", streaming=False
)
features = transactions_full.get_input_fields(
numerical_fields=numerical_fields, categorical_fields=categorical_fields
)
outputs = best_model.predict(features)
print(
"Accuracy: ",
metrics.accuracy_score(labels_df["is_fraud"], outputs["predicted_class"]),
)
print("F1: ", metrics.f1_score(labels_df["is_fraud"], outputs["predicted_class"]))