---
jupytext:
  cell_metadata_filter: all
  formats: md:myst
  main_language: python
  notebook_metadata_filter: all
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.4
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

+++ {"lines_to_next_cell": 0}

(serve_llm)=

# Serve LLMs with Ollama

In this guide, you'll learn how to locally serve Gemma2 and fine-tuned Llama3 models using Ollama within a Flyte task.

Start by importing Ollama from the `flytekitplugins.inference` package and specifying the desired model name.

Below is a straightforward example of serving a Gemma2 model:

```{code-cell}
from flytekit import ImageSpec, Resources, task
from flytekit.extras.accelerators import A10G
from flytekitplugins.inference import Model, Ollama
from openai import OpenAI

image = ImageSpec(
    name="ollama_serve",
    registry="ghcr.io/flyteorg",
    packages=["flytekitplugins-inference"],
    builder="default",
)

ollama_instance = Ollama(model=Model(name="gemma2"), gpu="1")


@task(
    container_image=image,
    pod_template=ollama_instance.pod_template,
    accelerator=A10G,
    requests=Resources(gpu="0"),
)
def model_serving(user_prompt: str) -> str:
    client = OpenAI(base_url=f"{ollama_instance.base_url}/v1", api_key="ollama")  # api key required but ignored

    completion = client.chat.completions.create(
        model="gemma2",
        messages=[
            {
                "role": "user",
                "content": user_prompt,
            }
        ],
        temperature=0.5,
        top_p=1,
        max_tokens=1024,
    )

    return completion.choices[0].message.content
```

+++ {"lines_to_next_cell": 0}

:::{important}
Replace `ghcr.io/flyteorg` with a container registry to which you can publish.
To upload the image to the local registry in the demo cluster, indicate the registry as `localhost:30000`.
:::

The `model_serving` task initiates a sidecar service to serve the model, making it accessible on localhost via the `base_url` property.
You can use either the chat or chat completion endpoints.

By default, Ollama initializes the server with `cpu`, `gpu`, and `mem` set to `1`, `1`, and `15Gi`, respectively.
You can adjust these settings to meet your requirements.

To serve a fine-tuned model, provide the model configuration as `modelfile` within the `Model` dataclass.

Below is an example of specifying a fine-tuned LoRA adapter for a Llama3 Mario model:

```{code-cell}
:lines_to_next_cell: 2

from flytekit.types.file import FlyteFile

finetuned_ollama_instance = Ollama(
    model=Model(
        name="llama3-mario",
        modelfile="FROM llama3\nADAPTER {inputs.ggml}\nPARAMETER temperature 1\nPARAMETER num_ctx 4096\nSYSTEM {inputs.system_prompt}",
    ),
    gpu="1",
)


@task(
    container_image=image,
    pod_template=finetuned_ollama_instance.pod_template,
    accelerator=A10G,
    requests=Resources(gpu="0"),
)
def finetuned_model_serving(ggml: FlyteFile, system_prompt: str):
    ...
```

`{inputs.ggml}` and `{inputs.system_prompt}` are materialized at run time, with `ggml` and `system_prompt` available as inputs to the task.

Ollama models can be integrated into different stages of your AI workflow, including data pre-processing,
model inference, and post-processing. Flyte also allows serving multiple Ollama models simultaneously
on various instances.

This integration enables you to self-host and serve AI models on your own infrastructure,
ensuring full control over costs and data security.

For more detailed information on the models natively supported by Ollama, visit the [Ollama models library](https://ollama.com/library).