LLM Embeddings
One of the most important ways to model NLP tasks is to use pre-trained language model embeddings. This notebook covers how to download pre-trained models, use them to get text embeddings and build ML models on top of these embeddings using TurboML. We'll demonstrate this on a SMS Spam classification use-case.
Getting the dataset
from river import datasets
import pandas as pd
import turboml as tb
dataset = datasets.SMSSpam()
dataset
dict_list_x = []
dict_list_y = []
for x, y in dataset:
dict_list_x.append(x)
dict_list_y.append({"label": float(y)})
df_features = pd.DataFrame.from_dict(dict_list_x).reset_index()
df_labels = pd.DataFrame.from_dict(dict_list_y).reset_index()
df_features
df_labels
features = tb.PandasDataset(
dataset_name="sms_spam_features",
key_field="index",
dataframe=df_features,
upload=True,
)
labels = tb.PandasDataset(
dataset_name="sms_spam_labels", key_field="index", dataframe=df_labels, upload=True
)
model_features = features.get_input_fields(textual_fields=["body"])
model_label = labels.get_label_field(label_field="label")
Downloading pre-trained models
Huggingface Hub (https://huggingface.co/models (opens in a new tab)) is one of the largest collection of pre-trained language models. It also has native intergrations with the GGUF format (https://huggingface.co/docs/hub/en/gguf (opens in a new tab)). This format is quickly becoming the standard for saving and loading models, and popular open-source projects like llama.cpp and GPT4All use this format. TurboML also uses the GGUF format to load pre-trained models. Here's how you can specify a model from Huggingface Hub, and TurboML will download and convert this in the right format.
We also support quantization of the model for conversion. The supported options are "f32", "f16", "bf16", "q8_0", "auto", where "f32" is for float32, "f16" for float16, "bf16" for bfloat16, "q8_0" for Q8_0, "auto" for the highest-fidelity 16-bit float type depending on the first loaded tensor type. "auto" is the default option.
For this notebook, we'll use the https://huggingface.co/BAAI/bge-small-en-v1.5 (opens in a new tab) model, with "f16" quantization.
gguf_model = tb.acquire_hf_model_as_gguf("BAAI/bge-small-en-v1.5", "f16")
gguf_model
Once we have converted the pre-trained model, we can now use this to generate embeddings. Here's how
embedding_model = tb.LLAMAEmbedding(gguf_model_id=gguf_model)
deployed_model = embedding_model.deploy(
"bert_embedding", input=model_features, labels=model_label
)
outputs = deployed_model.get_outputs()
embedding = outputs[0].get("record").embeddings
print(
"Length of the embedding vector is:",
len(embedding),
". The first 5 values are:",
embedding[:5],
)
But embeddings directly don't solve our use-case! We ultimately need a classification model for spam detection. We can build a pre-processor that converts all our text data into numerical embeddings, and then these numerical values can be passed to a classifier model.
model = tb.LlamaCppPreProcessor(base_model=tb.SGTClassifier(), gguf_model_id=gguf_model)
deployed_model = model.deploy(
"bert_sgt_classifier", input=model_features, labels=model_label
)
outputs = deployed_model.get_outputs()
outputs[0]