Getting started

LocalAI is available as a container image and binary. You can check out all the available images with corresponding tags here.

For a step by step how to of setting up LocalAI, Please see our How to page.

The easiest way to run LocalAI is by using docker compose or with Docker (to build locally, see the build section). The following example uses docker compose:


git clone https://github.com/go-skynet/LocalAI

cd LocalAI

# (optional) Checkout a specific LocalAI tag
# git checkout -b build <TAG>

# copy your models to models/
cp your-model.bin models/

# (optional) Edit the .env file to set things like context size and threads
# vim .env

# start with docker compose
docker compose up -d --pull always
# or you can build the images with:
# docker compose up -d --build

# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"your-model.bin","object":"model"}]}

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "your-model.bin",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

Example: Use luna-ai-llama2 model with `docker compose`

# Clone LocalAI
git clone https://github.com/go-skynet/LocalAI

cd LocalAI

# (optional) Checkout a specific LocalAI tag
# git checkout -b build <TAG>

# Download luna-ai-llama2 to models/
wget  https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/blob/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2

# Use a template from the examples
cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl

# (optional) Edit the .env file to set things like context size and threads
# vim .env

# start with docker compose
docker compose up -d --pull always
# or you can build the images with:
# docker compose up -d --build
# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "luna-ai-llama2",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'

# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}

Note

If running on Apple Silicon (ARM) it is not suggested to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
If you are running Apple x86_64 you can use docker, there is no additional gain into building it from source.
If you are on Windows, please run docker-compose not docker compose and make sure the project is in the Linux Filesystem, otherwise loading models might be slow. For more Info: Microsoft Docs

From binaries

LocalAI binary releases are available in Github.

You can control LocalAI with command line arguments, to specify a binding address, or the number of threads.

Usage:

local-ai --models-path <model_path> [--address <address>] [--threads <num_threads>]

Parameter	Environmental Variable	Default Variable	Description
–f16	$F16	false	Enable f16 mode
–debug	$DEBUG	false	Enable debug mode
–cors	$CORS	false	Enable CORS support
–cors-allow-origins value	$CORS_ALLOW_ORIGINS		Specify origins allowed for CORS
–threads value	$THREADS	4	Number of threads to use for parallel computation
–models-path value	$MODELS_PATH	./models	Path to the directory containing models used for inferencing
–preload-models value	$PRELOAD_MODELS		List of models to preload in JSON format at startup
–preload-models-config value	$PRELOAD_MODELS_CONFIG		A config with a list of models to apply at startup. Specify the path to a YAML config file
–config-file value	$CONFIG_FILE		Path to the config file
–address value	$ADDRESS	:8080	Specify the bind address for the API server
–image-path value	$IMAGE_PATH		Path to the directory used to store generated images
–context-size value	$CONTEXT_SIZE	512	Default context size of the model
–upload-limit value	$UPLOAD_LIMIT	15	Default upload limit in megabytes (audio file upload)

Docker

LocalAI has a set of images to support CUDA, ffmpeg and ‘vanilla’ (CPU-only). The image list is on quay:

Vanilla images tags: master, v1.18.0, …
FFmpeg images tags: master-ffmpeg, v1.18.0-ffmpeg, …
CUDA 11 tags: master-cublas-cuda11, v1.18.0-cublas-cuda11, …
CUDA 12 tags: master-cublas-cuda12, v1.18.0-cublas-cuda12, …
CUDA 11 + FFmpeg tags: master-cublas-cuda11-ffmpeg, v1.18.0-cublas-cuda11-ffmpeg, …
CUDA 12 + FFmpeg tags: master-cublas-cuda12-ffmpeg, v1.18.0-cublas-cuda12-ffmpeg, …

Example:

Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.19.2
FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda12-ffmpeg

Example of starting the API with docker:

docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4

You should see:

┌───────────────────────────────────────────────────┐
│                   Fiber v2.42.0                   │
│               http://127.0.0.1:8080               │
│       (bound on host 0.0.0.0 and port 8080)       │
│                                                   │
│ Handlers ............. 1  Processes ........... 1 │
│ Prefork ....... Disabled  PID ................. 1 │
└───────────────────────────────────────────────────┘

Note

Note: the binary inside the image is pre-compiled and might not suite all the CPU rebuild at the start of the container to enable CPU optimizations for the execution environment, you can set the environment variable REBUILD to false to prevent this behavior.

CUDA:

Requirement: nvidia-container-toolkit (installation instructions 1 2)

You need to run the image with --gpus all, and

docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e PRELOAD_MODELS='[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]' -e THREADS=1 -v $PWD/models:/models quay.io/go-skynet/local-ai:v1.23.2-cublas-cuda12

In the terminal where LocalAI was started, you should see:

5:13PM DBG Config overrides map[gpu_layers:10]
5:13PM DBG Checking "open-llama-7b-q4_0.bin" exists and matches SHA
5:13PM DBG Downloading "https://huggingface.co/SlyEcho/open_llama_7b_ggml/resolve/main/open-llama-7b-q4_0.bin"
5:13PM DBG Downloading open-llama-7b-q4_0.bin: 393.4 MiB/3.5 GiB (10.88%) ETA: 40.965550709s
5:13PM DBG Downloading open-llama-7b-q4_0.bin: 870.8 MiB/3.5 GiB (24.08%) ETA: 31.526866642s
5:13PM DBG Downloading open-llama-7b-q4_0.bin: 1.3 GiB/3.5 GiB (36.26%) ETA: 26.37351405s
5:13PM DBG Downloading open-llama-7b-q4_0.bin: 1.7 GiB/3.5 GiB (48.64%) ETA: 21.11682624s
5:13PM DBG Downloading open-llama-7b-q4_0.bin: 2.2 GiB/3.5 GiB (61.49%) ETA: 15.656029361s
5:14PM DBG Downloading open-llama-7b-q4_0.bin: 2.6 GiB/3.5 GiB (74.33%) ETA: 10.360950226s
5:14PM DBG Downloading open-llama-7b-q4_0.bin: 3.1 GiB/3.5 GiB (87.05%) ETA: 5.205663978s
5:14PM DBG Downloading open-llama-7b-q4_0.bin: 3.5 GiB/3.5 GiB (99.85%) ETA: 61.269714ms
5:14PM DBG File "open-llama-7b-q4_0.bin" downloaded and verified
5:14PM DBG Prompt template "openllama-completion" written
5:14PM DBG Prompt template "openllama-chat" written
5:14PM DBG Written config file /models/gpt-3.5-turbo.yaml

LocalAI will download automatically the OpenLLaMa model and run with GPU. Wait for the download to complete. You can also avoid automatic download of the model by not specifying a PRELOAD_MODELS variable. For compatible models with GPU support see the model compatibility table.

To test that the API is working run in another terminal:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "What is an alpaca?"}],
     "temperature": 0.1
   }'

And if the GPU inferencing is working, you should be able to see something like:

5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size  =  512.00 MB

Note

When enabling GPU inferencing, set the number of GPU layers to offload with: gpu_layers: 1 to your YAML model config file and f16: true. You might also need to set low_vram: true if the device has low VRAM.

Run LocalAI in Kubernetes

LocalAI can be installed inside Kubernetes with helm.

Requirements:

SSD storage class, or disable mmap to load the whole model in memory

By default, the helm chart will install LocalAI instance using the ggml-gpt4all-j model without persistent storage.

Add the helm repo

helm repo add go-skynet https://go-skynet.github.io/helm-charts/

Install the helm chart:

helm repo update
helm install local-ai go-skynet/local-ai -f values.yaml

Note: For further configuration options, see the helm chart repository on GitHub.

Example values

Deploy a single LocalAI pod with 6GB of persistent storage serving up a ggml-gpt4all-j model with custom prompt.

### values.yaml

replicaCount: 1

deployment:
  image: quay.io/go-skynet/local-ai:latest
  env:
    threads: 4
    context_size: 512
  modelsPath: "/models"

resources:
  {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
  {}
  # ggml-gpt4all-j.tmpl: |
  #   The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
  #   ### Prompt:
  #   {{.Input}}
  #   ### Response:

# Models to download at runtime
models:
  # Whether to force download models even if they already exist
  forceDownload: false

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
  - url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
      # basicAuth: base64EncodedCredentials

  # Persistent storage for models and prompt templates.
  # PVC and HostPath are mutually exclusive. If both are enabled,
  # PVC configuration takes precedence. If neither are enabled, ephemeral
  # storage is used.
  persistence:
    pvc:
      enabled: false
      size: 6Gi
      accessModes:
        - ReadWriteOnce

      annotations: {}

      # Optional
      storageClass: ~

    hostPath:
      enabled: false
      path: "/models"

service:
  type: ClusterIP
  port: 80
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"

ingress:
  enabled: false
  className: ""
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nodeSelector: {}

tolerations: []

affinity: {}

Build from source

See the build section.

Other examples

To see other examples on how to integrate with other projects for instance for question answering or for using it with chatbot-ui, see: examples.

Clients

OpenAI clients are already compatible with LocalAI by overriding the basePath, or the target URL.

Javascript

https://github.com/openai/openai-node/

import { Configuration, OpenAIApi } from 'openai';

const configuration = new Configuration({
  basePath: `http://localhost:8080/v1`
});
const openai = new OpenAIApi(configuration);

Python

https://github.com/openai/openai-python

Set the OPENAI_API_BASE environment variable, or by code:

import openai

openai.api_base = "http://localhost:8080/v1"

# create a chat completion
chat_completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])

# print the completion
print(completion.choices[0].message.content)

Getting started

Example: Use luna-ai-llama2 model with docker compose

From binaries

Docker

CUDA:

Run LocalAI in Kubernetes

Example values

Build from source

Other examples

Clients

Javascript

Python

Example: Use luna-ai-llama2 model with `docker compose`