LLMs
LLM Tutorial

TurboML LLM Tutorial

Open In Colab (opens in a new tab)

TurboML can spin up LLM servers with an OpenAI-compatible API. We currently support models in the GGUF format, but also support non-GGUF models that can be converted to GGUF. In the latter case you get to decide the quantization type you want to use.

    import turboml as tb
    
    LlamaServerRequest = tb.llm.LlamaServerRequest
    HuggingFaceSpec = LlamaServerRequest.HuggingFaceSpec
    ServerParams = LlamaServerRequest.ServerParams

Choose a model

Let's use a Llama 3.2 quant already in the GGUF format.

    hf_spec = HuggingFaceSpec(
        hf_repo_id="hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF",
        select_gguf_file="llama-3.2-3b-instruct-q8_0.gguf",
    )

Spawn a server

On spawning a server, you get a server_id to reference it later as well as server_relative_url you can use to reach it. This method is synchronous, so it can take a while to yield as we retrieve (and convert) your model.

    response = tb.llm.spawn_llm_server(
        LlamaServerRequest(
            source_type=LlamaServerRequest.SourceType.HUGGINGFACE,
            hf_spec=hf_spec,
            server_params=ServerParams(
                threads=-1,
                seed=-1,
                context_size=0,
                flash_attention=False,
            ),
        )
    )
    response
    server_id = response.server_id

Interacting with the LLM

Our LLM is exposed with an OpenAI-compatible API, so we can use the OpenAI SDK, or any other tool compatible tool to use it.

    %pip install openai
    from openai import OpenAI
    
    base_url = tb.common.env.CONFIG.EXTERNAL_ENDPOINT_TURBOML_BACKEND
    server_url = f"http://{base_url}/{response.server_relative_url}"
    
    client = OpenAI(base_url=server_url, api_key="-")
    
    
    response = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": "Hello there how are you doing today?",
            }
        ],
        model="-",
    )
    
    print(response)

Stop the server

    tb.llm.stop_llm_server(server_id)