LLMs
LLM Tutorial

TurboML LLM Tutorial

Open In Colab (opens in a new tab)

TurboML can spin up LLM servers with an OpenAI-compatible API. We currently support models in the GGUF format, but also support non-GGUF models that can be converted to GGUF. In the latter case you get to decide the quantization type you want to use.

import turboml as tb
LlamaServerRequest = tb.llm.LlamaServerRequest
HuggingFaceSpec = LlamaServerRequest.HuggingFaceSpec
ServerParams = LlamaServerRequest.ServerParams

Choose a model

Let's use a Llama 3.2 quant already in the GGUF format.

hf_spec = HuggingFaceSpec(
    hf_repo_id="hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
    select_gguf_file="llama-3.2-1b-instruct-q4_k_m.gguf",
)

Spawn a server

On spawning a server, you get a server_id to reference it later as well as server_relative_url you can use to reach it. This method is synchronous, so it can take a while to yield as we retrieve (and convert) your model.

response = tb.llm.spawn_llm_server(
    LlamaServerRequest(
        source_type=LlamaServerRequest.SourceType.HUGGINGFACE,
        hf_spec=hf_spec,
        server_params=ServerParams(
            threads=-1,
            seed=-1,
            context_size=0,
            flash_attention=False,
        ),
    )
)
response
server_id = response.server_id

Interacting with the LLM

Our LLM is exposed with an OpenAI-compatible API, so we can use the OpenAI SDK, or any other tool compatible tool to use it.

%pip install openai
from openai import OpenAI
 
base_url = tb.common.env.CONFIG.TURBOML_BACKEND_SERVER_ADDRESS
server_url = f"{base_url}/{response.server_relative_url}"
 
client = OpenAI(base_url=server_url, api_key="-")
 
 
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Hello there how are you doing today?",
        }
    ],
    model="-",
)
 
print(response)
embeddings = (
    client.embeddings.create(input=["Hello there how are you doing today?"], model="-")
    .data[0]
    .embedding
)
len(embeddings), embeddings[:5]

Stop the server

tb.llm.stop_llm_server(server_id)