TurboML LLM Tutorial
TurboML can spin up LLM servers with an OpenAI-compatible API. We currently support models in the GGUF format, but also support non-GGUF models that can be converted to GGUF. In the latter case you get to decide the quantization type you want to use.
import turboml as tb
LlamaServerRequest = tb.llm.LlamaServerRequest
HuggingFaceSpec = LlamaServerRequest.HuggingFaceSpec
ServerParams = LlamaServerRequest.ServerParams
Choose a model
Let's use a Llama 3.2 quant already in the GGUF format.
hf_spec = HuggingFaceSpec(
hf_repo_id="hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF",
select_gguf_file="llama-3.2-3b-instruct-q8_0.gguf",
)
Spawn a server
On spawning a server, you get a server_id
to reference it later as well as server_relative_url
you can
use to reach it. This method is synchronous, so it can take a while to yield as we retrieve (and convert) your model.
response = tb.llm.spawn_llm_server(
LlamaServerRequest(
source_type=LlamaServerRequest.SourceType.HUGGINGFACE,
hf_spec=hf_spec,
server_params=ServerParams(
threads=-1,
seed=-1,
context_size=0,
flash_attention=False,
),
)
)
response
server_id = response.server_id
Interacting with the LLM
Our LLM is exposed with an OpenAI-compatible API, so we can use the OpenAI SDK, or any other tool compatible tool to use it.
%pip install openai
from openai import OpenAI
base_url = tb.common.env.CONFIG.EXTERNAL_ENDPOINT_TURBOML_BACKEND
server_url = f"http://{base_url}/{response.server_relative_url}"
client = OpenAI(base_url=server_url, api_key="-")
response = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Hello there how are you doing today?",
}
],
model="-",
)
print(response)
Stop the server
tb.llm.stop_llm_server(server_id)