--- jupytext: cell_metadata_filter: all formats: md:myst main_language: python notebook_metadata_filter: all text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.4 kernelspec: display_name: Python 3 language: python name: python3 --- +++ {"lines_to_next_cell": 0} (serve_nim_container)= # Serve Generative AI Models with NIM This guide demonstrates how to serve a Llama 3 8B model locally with NIM within a Flyte task. First, instantiate NIM by importing it from the `flytekitplugins.inference` package and specifying the image name along with the necessary secrets. The `ngc_image_secret` is required to pull the image from NGC, the `ngc_secret_key` is used to pull models from NGC after the container is up and running, and `secrets_prefix` is the environment variable prefix to access {ref}`secrets `. Below is a simple task that serves a Llama NIM container: ```{code-cell} from flytekit import ImageSpec, Resources, Secret, task from flytekit.extras.accelerators import A10G from flytekitplugins.inference import NIM, NIMSecrets from openai import OpenAI image = ImageSpec( name="nim", registry="ghcr.io/flyteorg", packages=["flytekitplugins-inference"], ) nim_instance = NIM( image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0", secrets=NIMSecrets( ngc_image_secret="nvcrio-cred", ngc_secret_key="ngc-api-key", ngc_secret_group="ngc", secrets_prefix="_FSEC_", ), ) @task( container_image=image, pod_template=nim_instance.pod_template, accelerator=A10G, secret_requests=[ Secret( group="ngc", key="ngc-api-key", mount_requirement=Secret.MountType.ENV_VAR ) # must be mounted as an env var ], requests=Resources(gpu="0"), ) def model_serving() -> str: client = OpenAI(base_url=f"{nim_instance.base_url}/v1", api_key="nim") # api key required but ignored completion = client.chat.completions.create( model="meta/llama3-8b-instruct", messages=[ { "role": "user", "content": "Write a limerick about the wonders of GPU computing.", } ], temperature=0.5, top_p=1, max_tokens=1024, ) return completion.choices[0].message.content ``` +++ {"lines_to_next_cell": 0} :::{important} Replace `ghcr.io/flyteorg` with a container registry to which you can publish. To upload the image to the local registry in the demo cluster, indicate the registry as `localhost:30000`. ::: The `model_serving` task initiates a sidecar service to serve the model, making it accessible on localhost via the `base_url` property. Both chat and chat completion endpoints can be utilized. You need to mount the secret as an environment variable, as it must be accessed by the `NGC_API_KEY` environment variable within the NIM container. By default, the NIM instantiation sets `cpu`, `gpu`, and `mem` to `1`, `1`, and `20Gi`, respectively. You can modify these settings as needed. To serve a fine-tuned Llama model, specify the HuggingFace repo ID in `hf_repo_ids` as `[]` and the LoRa adapter memory as `lora_adapter_mem`. Set the `NIM_PEFT_SOURCE` environment variable by including `env={"NIM_PEFT_SOURCE": "..."}` in the task decorator. Here is an example initialization for a fine-tuned Llama model: ```{code-cell} nim_instance = NIM( image="nvcr.io/nim/meta/llama3-8b-instruct:1.0.0", secrets=NIMSecrets( ngc_image_secret="nvcrio-cred", ngc_secret_key="ngc-api-key", ngc_secret_group="ngc", secrets_prefix="_FSEC_", hf_token_key="hf-key", hf_token_group="hf", ), hf_repo_ids=[""], lora_adapter_mem="500Mi", env={"NIM_PEFT_SOURCE": "/home/nvs/loras"}, ) ``` :::{note} Native directory and NGC support for LoRa adapters coming soon. ::: NIM containers can be integrated into different stages of your AI workflow, including data pre-processing, model inference, and post-processing. Flyte also allows serving multiple NIM containers simultaneously, each with different configurations on various instances. This integration enables you to self-host and serve optimized AI models on your own infrastructure, ensuring full control over costs and data security. By eliminating dependence on third-party APIs for AI model access, you gain not only enhanced control but also potentially lower expenses compared to traditional API services. For more detailed information, refer to the [NIM documentation by NVIDIA](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html).