TurboML LLM Tutorial
TurboML can spin up LLM servers with an OpenAI-compatible API. We currently support models in the GGUF format, but also support non-GGUF models that can be converted to GGUF. In the latter case you get to decide the quantization type you want to use.
import pandas as pd
import turboml as tb
LlamaServerRequest = tb.llm.LlamaServerRequest
HuggingFaceSpec = LlamaServerRequest.HuggingFaceSpec
ServerParams = LlamaServerRequest.ServerParams
Choose a model
Let's use a Llama 3.2 quant already in the GGUF format.
hf_spec = HuggingFaceSpec(
hf_repo_id="hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF",
select_gguf_file="llama-3.2-3b-instruct-q8_0.gguf",
)
Spawn a server
On spawning a server, you get a server_id
to reference it later as well as server_relative_url
you can
use to reach it. This method is synchronous, so it can take a while to yield as we retrieve (and convert) your model.
response = tb.llm.spawn_llm_server(
LlamaServerRequest(
source_type=LlamaServerRequest.SourceType.HUGGINGFACE,
hf_spec=hf_spec,
server_params=ServerParams(
threads=-1,
seed=-1,
context_size=0,
flash_attention=False,
),
)
)
response
server_id = response.server_id
Interacting with the LLM
Our LLM is exposed with an OpenAI-compatible API, so we can use the OpenAI SDK, or any other tool compatible tool to use it.
%pip install openai
from openai import OpenAI
base_url = tb.common.env.CONFIG.TURBOML_BACKEND_SERVER_ADDRESS
server_url = f"{base_url}/{response.server_relative_url}"
client = OpenAI(base_url=server_url, api_key="-")
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Hello there how are you doing today?",
}
],
model="-",
)
print(response)
Stop the server
tb.llm.stop_llm_server(server_id)