LLM Embeddings

(opens in a new tab)

One of the most important ways to model NLP tasks is to use pre-trained language model embeddings. This notebook covers how to download pre-trained models, use them to get text embeddings and build ML models on top of these embeddings using TurboML. We'll demonstrate this on a SMS Spam classification use-case.

import turboml as tb

The Dataset

We choose the standard SMS Spam dataset for this example

!pip install river

import pandas as pd
from river import datasets
 
dataset = datasets.SMSSpam()
dataset

dict_list_x = []
dict_list_y = []
for x, y in dataset:
    dict_list_x.append(x)
    dict_list_y.append({"label": float(y)})

df_features = pd.DataFrame.from_dict(dict_list_x).reset_index()
df_labels = pd.DataFrame.from_dict(dict_list_y).reset_index()

df_features

df_labels

features = tb.OnlineDataset.from_pd(
    df=df_features, key_field="index", id="sms_spam_feat", load_if_exists=True
)
labels = tb.OnlineDataset.from_pd(
    df=df_labels, key_field="index", id="sms_spam_labels", load_if_exists=True
)

model_features = features.get_model_inputs(textual_fields=["body"])
model_label = labels.get_model_labels(label_field="label")

Downloading pre-trained models

Huggingface Hub (https://huggingface.co/models (opens in a new tab)) is one of the largest collection of pre-trained language models. It also has native intergrations with the GGUF format (https://huggingface.co/docs/hub/en/gguf (opens in a new tab)). This format is quickly becoming the standard for saving and loading models, and popular open-source projects like llama.cpp and GPT4All use this format. TurboML also uses the GGUF format to load pre-trained models. Here's how you can specify a model from Huggingface Hub, and TurboML will download and convert this in the right format.

We also support quantization of the model for conversion. The supported options are "f32", "f16", "bf16", "q8_0", "auto", where "f32" is for float32, "f16" for float16, "bf16" for bfloat16, "q8_0" for Q8_0, "auto" for the highest-fidelity 16-bit float type depending on the first loaded tensor type. "auto" is the default option.

For this notebook, we'll use the https://huggingface.co/BAAI/bge-small-en-v1.5 (opens in a new tab) model, with "f16" quantization.

gguf_model = tb.llm.acquire_hf_model_as_gguf("BAAI/bge-small-en-v1.5", "f16")
gguf_model

Once we have converted the pre-trained model, we can now use this to generate embeddings. Here's how

embedding_model = tb.LLAMAEmbedding(gguf_model_id=gguf_model)
deployed_model = embedding_model.deploy(
    "bert_embedding", input=model_features, labels=model_label
)

outputs = deployed_model.get_outputs()
embedding = outputs[0].get("record").embeddings
print(
    "Length of the embedding vector is:",
    len(embedding),
    ". The first 5 values are:",
    embedding[:5],
)

But embeddings directly don't solve our use-case! We ultimately need a classification model for spam detection. We can build a pre-processor that converts all our text data into numerical embeddings, and then these numerical values can be passed to a classifier model.

model = tb.LlamaCppPreProcessor(base_model=tb.SGTClassifier(), gguf_model_id=gguf_model)

deployed_model = model.deploy(
    "bert_sgt_classifier", input=model_features, labels=model_label
)

outputs = deployed_model.get_outputs()
outputs[0]

Custom Evaluation Metric Image Embeddings