LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families that are compatible with the ggml format. Does not require GPU. It is maintained by mudler.
Follow LocalAI
Connect with the Creator
Share LocalAI Repository
In a nutshell:
Local, OpenAI drop-in alternative REST API. You own your data.
NO GPU required. NO Internet access is required either
Optional, GPU Acceleration is available in llama.cpp-compatible LLMs. See also the build section.
Supports multiple models
π Once loaded the first time, it keep models loaded in memory for faster inference
β‘ Doesn’t shell-out, but uses C++ bindings for a faster inference and better performance.
LocalAI was created by Ettore Di Giacinto and is a community-driven project, focused on making the AI accessible to anyone. Any contribution, feedback and PR is welcome!
Note that this started just as a fun weekend project in order to try to create the necessary pieces for a full AI assistant like ChatGPT: the community is growing fast and we are working hard to make it better and more stable. If you want to help, please consider contributing (see below)!
LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.
Hacker news post - help us out by voting if you like this project.
If you have technological skills and want to contribute to development, have a look at the open issues. If you are new you can have a look at the good-first-issue and help-wanted labels.
If you don’t have technological skills you can still help improving documentation or add examples or share your user-stories with our community, any help and contribution is welcome!
As much as typical open source projects starts, I, mudler, was fiddling around with llama.cpp over my long nights and wanted to have a way to call it from go, as I am a Golang developer and use it extensively. So I’ve created LocalAI (or what was initially known as llama-cli) and added an API to it.
But guess what? The more I dived into this rabbit hole, the more I realized that I had stumbled upon something big. With all the fantastic C++ projects floating around the community, it dawned on me that I could piece them together to create a full-fledged OpenAI replacement. So, ta-da! LocalAI was born, and it quickly overshadowed its humble origins.
Now, why did I choose to go with C++ bindings, you ask? Well, I wanted to keep LocalAI snappy and lightweight, allowing it to run like a champ on any system and avoid any Golang penalties of the GC, and, most importantly built on shoulders of giants like llama.cpp. Go is good at backends and API and is easy to maintain. And hey, don’t forget that I’m all about sharing the love. That’s why I made LocalAI MIT licensed, so everyone can hop on board and benefit from it.
As if that wasn’t exciting enough, as the project gained traction, mkellerman and Aisuko jumped in to lend a hand. mkellerman helped set up some killer examples, while Aisuko is becoming our community maestro. The community now is growing even more with new contributors and users, and I couldn’t be happier about it!
Oh, and let’s not forget the real MVP hereβllama.cpp. Without this extraordinary piece of software, LocalAI wouldn’t even exist. So, a big shoutout to the community for making this magic happen!
π€ Contributors
This is a community project, a special thanks to our contributors! π€
Subsections of LocalAI
Getting started
LocalAI is available as a container image and binary. You can check out all the available images with corresponding tags here.
For a step by step how to of setting up LocalAI, Please see our How to page.
The easiest way to run LocalAI is by using docker compose or with Docker (to build locally, see the build section). The following example uses docker compose:
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
# (optional) Checkout a specific LocalAI tag# git checkout -b build <TAG># copy your models to models/cp your-model.bin models/
# (optional) Edit the .env file to set things like context size and threads# vim .env# start with docker composedocker compose up -d --pull always
# or you can build the images with:# docker compose up -d --build# Now API is accessible at localhost:8080curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"your-model.bin","object":"model"}]}curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Example: Use luna-ai-llama2 model with docker compose
# Clone LocalAIgit clone https://github.com/go-skynet/LocalAI
cd LocalAI
# (optional) Checkout a specific LocalAI tag# git checkout -b build <TAG># Download luna-ai-llama2 to models/wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/blob/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2
# Use a template from the examplescp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl
# (optional) Edit the .env file to set things like context size and threads# vim .env# start with docker composedocker compose up -d --pull always
# or you can build the images with:# docker compose up -d --build# Now API is accessible at localhost:8080curl http://localhost:8080/v1/models
# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "luna-ai-llama2",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9
}'# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}
Note
If running on Apple Silicon (ARM) it is not suggested to run on Docker due to emulation. Follow the build instructions to use Metal acceleration for full GPU support.
If you are running Apple x86_64 you can use docker, there is no additional gain into building it from source.
If you are on Windows, please run docker-compose not docker compose and make sure the project is in the Linux Filesystem, otherwise loading models might be slow. For more Info: Microsoft Docs
Note: the binary inside the image is pre-compiled and might not suite all the CPU rebuild at the start of the container to enable CPU optimizations for the execution environment, you can set the environment variable REBUILD to false to prevent this behavior.
LocalAI will download automatically the OpenLLaMa model and run with GPU. Wait for the download to complete. You can also avoid automatic download of the model by not specifying a PRELOAD_MODELS variable. For compatible models with GPU support see the model compatibility table.
To test that the API is working run in another terminal:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "What is an alpaca?"}],
"temperature": 0.1
}'
And if the GPU inferencing is working, you should be able to see something like:
5:22PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla T4
llama.cpp: loading model from /models/open-llama-7b-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 1024
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 4321.77 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 10 repeating layers to GPU
llama_model_load_internal: offloaded 10/35 layers to GPU
llama_model_load_internal: total VRAM used: 1598 MB
...................................................................................................
llama_init_from_file: kv self size = 512.00 MB
Note
When enabling GPU inferencing, set the number of GPU layers to offload with: gpu_layers: 1 to your YAML model config file and f16: true. You might also need to set low_vram: true if the device has low VRAM.
Run LocalAI in Kubernetes
LocalAI can be installed inside Kubernetes with helm.
Requirements:
SSD storage class, or disable mmap to load the whole model in memory
By default, the helm chart will install LocalAI instance using the ggml-gpt4all-j model without persistent storage.
Deploy a single LocalAI pod with 6GB of persistent storage serving up a ggml-gpt4all-j model with custom prompt.
### values.yamlreplicaCount:1deployment:image:quay.io/go-skynet/local-ai:latestenv:threads:4context_size:512modelsPath:"/models"resources:{}# We usually recommend not to specify default resources and to leave this as a conscious# choice for the user. This also increases chances charts run on environments with little# resources, such as Minikube. If you do want to specify resources, uncomment the following# lines, adjust them as necessary, and remove the curly braces after 'resources:'.# limits:# cpu: 100m# memory: 128Mi# requests:# cpu: 100m# memory: 128Mi# Prompt templates to include# Note: the keys of this map will be the names of the prompt template filespromptTemplates:{}# ggml-gpt4all-j.tmpl: |# The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.# ### Prompt:# {{.Input}}# ### Response:# Models to download at runtimemodels:# Whether to force download models even if they already existforceDownload:false# The list of URLs to download models from# Note: the name of the file will be the name of the loaded modellist:- url:"https://gpt4all.io/models/ggml-gpt4all-j.bin"# basicAuth: base64EncodedCredentials# Persistent storage for models and prompt templates.# PVC and HostPath are mutually exclusive. If both are enabled,# PVC configuration takes precedence. If neither are enabled, ephemeral# storage is used.persistence:pvc:enabled:falsesize:6GiaccessModes:- ReadWriteOnceannotations:{}# OptionalstorageClass:~hostPath:enabled:falsepath:"/models"service:type:ClusterIPport:80annotations:{}# If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout# service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"ingress:enabled:falseclassName:""annotations:{}# kubernetes.io/ingress.class: nginx# kubernetes.io/tls-acme: "true"hosts:- host:chart-example.localpaths:- path:/pathType:ImplementationSpecifictls:[]# - secretName: chart-example-tls# hosts:# - chart-example.localnodeSelector:{}tolerations:[]affinity:{}
Set the OPENAI_API_BASE environment variable, or by code:
importopenaiopenai.api_base="http://localhost:8080/v1"# create a chat completionchat_completion=openai.ChatCompletion.create(model="gpt-3.5-turbo",messages=[{"role":"user","content":"Hello world"}])# print the completionprint(completion.choices[0].message.content)
π What's New
26-08-2023: v1.25.0
Hey everyone, Ettore here, I’m so happy to share this release out - while this summer is hot apparently doesn’t stop LocalAI development :)
This release brings a lot of new features, bugfixes and updates! Also a big shout out to the community, this was a great release!
Attention π¨
From this release the llama backend supports only gguf files (see
943
). LocalAI however still supports ggml files. We ship a version of llama.cpp before that change in a separate backend, named llama-stable to allow still loading ggml files. If you were specifying the llama backend manually to load ggml files from this release you should use llama-stable instead, or do not specify a backend at all (LocalAI will automatically handle this).
Image generation enhancements
The Diffusers backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the Diffusers documentation for more information.
Lora adapters
Now it’s possible to load lora adapters for llama.cpp. See
955
for more information.
Device management
It is now possible for single-devices with one GPU to specify --single-active-backend to allow only one backend active at the time
925
.
Community spotlight
Resources management
Thanks to the continous community efforts (another cool contribution from
dave-gray101
) now it’s possible to shutdown a backend programmatically via the API.
There is an ongoing effort in the community to better handling of resources. See also the π₯Roadmap.
New how-to section
Thanks to the community efforts now we have a new how-to section with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to
lunamidori5
from the community for the impressive efforts on this!
π‘ More examples!
Open source autopilot? See the new addition by
gruberdev
in our examples on how to use Continue with LocalAI!
feat: pre-configure LocalAI galleries by
mudler
in
886
πΆ Bark
Bark is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it’s available in the container images by default.
It can also generate music, see the example: lion.webm
π¦ AutoGPTQ
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
It is targeted mainly for GPU usage only. Check out the AutoGPTQ documentation for usage.
π¦ Exllama
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”. It is a faster alternative to run LLaMA models on GPU.Check out the Exllama documentation for usage.
𧨠Diffusers
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren’t tested yet. Check out the Diffusers documentation for usage.
π API Keys
Thanks to the community contributions now it’s possible to specify a list of API keys that can be used to gate API requests.
API Keys can be specified with the API_KEY environment variable as a comma-separated list of keys.
πΌοΈ Galleries
Now by default the model-gallery repositories are configured in the container images
π‘ New project
LocalAGI is a simple agent that uses LocalAI functions to have a full locally runnable assistant (with no API keys needed).
See it here in action planning a trip for San Francisco!
feat(llama2): add template for chat messages by
dave-gray101
in
782
Note
From this release to use the OpenAI functions you need to use the llama-grammar backend. It has been added a llama backend for tracking llama.cpp master and llama-grammar for the grammar functionalities that have not been merged yet upstream. See also OpenAI functions. Until the feature is merged we will have two llama backends.
Huggingface embeddings
In this release is now possible to specify to LocalAI external gRPC backends that can be used for inferencing
778
. It is now possible to write internal backends in any language, and a huggingface-embeddings backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also Embeddings.
LLaMa 2 has been released!
Thanks to the community effort now LocalAI supports templating for LLaMa2! more at:
782
until we update the model gallery with LLaMa2 models!
The former, ggml-based backend has been renamed to falcon-ggml.
Default pre-compiled binaries
From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile local-ai from scratch on start and switch back to the old behavior, you can set REBUILD=true in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the build section for more information.
Add Text-to-Audio generation with go-piper by
mudler
in
649
See API endpoints in our documentation.
Add gallery repository by
mudler
in
663
. See models for documentation.
Container images
Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.20.0
FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-cublas-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.20.0-cublas-cuda12-ffmpeg
Updates
Updates to llama.cpp, go-transformers, gpt4all.cpp and rwkv.cpp.
The NUMA option was enabled by
mudler
in
684
, along with many new parameters (mmap,mmlock, ..). See advanced for the full list of parameters.
Gallery repositories
In this release there is support for gallery repositories. These are repositories that contain models, and can be used to install models. The default gallery which contains only freely licensed models is in Github: https://github.com/go-skynet/model-gallery, but you can use your own gallery by setting the GALLERIES environment variable. An automatic index of huggingface models is available as well.
For example, now you can start LocalAI with the following environment variable to use both galleries:
Now LocalAI uses piper and go-piper to generate audio from text. This is an experimental feature, and it requires GO_TAGS=tts to be set during build. It is enabled by default in the pre-built container images.
Full CUDA GPU offload support ( PR by mudler. Thanks to chnyda for handing over the GPU access, and lu-zero to help in debugging )
Full GPU Metal Support is now fully functional. Thanks to Soleblaze to iron out the Metal Apple silicon support!
Container images:
Standard (GPT + stablediffusion): quay.io/go-skynet/local-ai:v1.19.2
FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-ffmpeg
CUDA 11+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda11-ffmpeg
CUDA 12+FFmpeg: quay.io/go-skynet/local-ai:v1.19.2-cublas-cuda12-ffmpeg
π₯π₯π₯ 06-06-2023: v1.18.0 π
This LocalAI release is plenty of new features, bugfixes and updates! Thanks to the community for the help, this was a great community release!
We now support a vast variety of models, while being backward compatible with prior quantization formats, this new release allows still to load older formats and new k-quants!
New features
β¨ Added support for falcon-based model families (7b) ( mudler )
β¨ Experimental support for Metal Apple Silicon GPU - ( mudler and thanks to Soleblaze for testing! ). See the build section.
β¨ Support for token stream in the /v1/completions endpoint ( samm81 )
π Bloomz has been updated to the latest ggml changes, including new quantization format ( mudler )
π RWKV has been updated to the new quantization format( mudler )
π k-quants format support for the llama models ( mudler )
π gpt4all has been updated, incorporating upstream changes allowing to load older models, and with different CPU instruction set (AVX only, AVX2) from the same binary! ( mudler )
Generic
π§ Fully Linux static binary releases ( mudler )
π· Stablediffusion has been enabled on container images by default ( mudler )
Note: You can disable container image rebuilds with REBUILD=false
llama.cpp models now can also automatically save the prompt cache state as well by specifying in the model YAML configuration file:
# Enable prompt caching# This is a file that will be used to save/load the cache. relative to the models directory.prompt_cache_path:"alpaca-cache"# Always enable prompt cacheprompt_cache_all:true
23-05-2023: v1.15.0 released. go-gpt2.cpp backend got renamed to go-ggml-transformers.cpp updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. This impacts RedPajama, GptNeoX, MPT(not gpt4all-mpt), Dolly, GPT2 and Starcoder based models. Binary releases available, various fixes, including
341
.
21-05-2023: v1.14.0 released. Minor updates to the /models/apply endpoint, llama.cpp backend updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. gpt4all is still compatible with the old format.
19-05-2023: v1.13.0 released! π₯π₯ updates to the gpt4all and llama backend, consolidated CUDA support (
310
thanks to @bubthegreat and @Thireus ), preliminar support for installing models via API.
17-05-2023: v1.12.0 released! π₯π₯ Minor fixes, plus CUDA (
258
) support for llama.cpp-compatible models and image generation (
272
).
16-05-2023: π₯π₯π₯ Experimental support for CUDA (
258
) in the llama.cpp backend and Stable diffusion CPU image generation (
272
) in master.
13-05-2023: v1.11.0 released! π₯ Updated llama.cpp bindings: This update includes a breaking change in the model files ( https://github.com/ggerganov/llama.cpp/pull/1405 ) - old models should still work with the gpt4all-llama backend.
12-05-2023: v1.10.0 released! π₯π₯ Updated gpt4all bindings. Added support for GPTNeox (experimental), RedPajama (experimental), Starcoder (experimental), Replit (experimental), MosaicML MPT. Also now embeddings endpoint supports tokens arrays. See the langchain-chroma example! Note - this update does NOT include https://github.com/ggerganov/llama.cpp/pull/1405 which makes models incompatible.
11-05-2023: v1.9.0 released! π₯ Important whisper updates (
233229
) and extended gpt4all model families support (
232
). Redpajama/dolly experimental (
214
)
10-05-2023: v1.8.0 released! π₯ Added support for fast and accurate embeddings with bert.cpp (
222
)
09-05-2023: Added experimental support for transcriptions endpoint (
211
)
08-05-2023: Support for embeddings with models using the llama.cpp backend (
207
)
02-05-2023: Support for rwkv.cpp models (
158
) and for /edits endpoint
01-05-2023: Support for SSE stream of tokens in llama.cpp backends (
152
)
Features
This section contains the documentation for the features supported by LocalAI.
To generate an image you can send a POST request to the /v1/images/generations endpoint with the instruction as the request body:
# 512x512 is supported toocurl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "A cute baby sea otter",
"size": "256x256"
}'
Available additional parameters: mode, step.
Note: To set a negative prompt, you can split the prompt with |, for instance: a cute baby sea otter|malformed.
curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
"prompt": "floating hair, portrait, ((loli)), ((one girl)), cute face, hidden hands, asymmetrical bangs, beautiful detailed eyes, eye shadow, hair ornament, ribbons, bowties, buttons, pleated skirt, (((masterpiece))), ((best quality)), colorful|((part of the head)), ((((mutated hands and fingers)))), deformed, blurry, bad anatomy, disfigured, poorly drawn face, mutation, mutated, extra limb, ugly, poorly drawn hands, missing limb, blurry, floating limbs, disconnected limbs, malformed hands, blur, out of focus, long neck, long body, Octane renderer, lowres, bad anatomy, bad hands, text",
"size": "256x256"
}'
stablediffusion-cpp
mode=0
mode=1 (winograd/sgemm)
Note: image generator supports images up to 512x512. You can use other tools however to upscale the image, for instance: https://github.com/upscayl/upscayl.
Setup
Note: In order to use the images/generation endpoint with the stablediffusion C++ backend, you need to build LocalAI with GO_TAGS=stablediffusion. If you are using the container images, it is already enabled.
While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
The models will be downloaded the first time you use the backend from huggingface automatically.
Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:
name:animagine-xlparameters:model:Linaqruf/animagine-xlbackend:diffusers# Force CPU usage - set to true for GPUf16:falsediffusers:pipeline_type:StableDiffusionXLPipelinecuda:false# Enable for GPU usage (CUDA)scheduler_type:euler_a
π Text generation (GPT)
LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the Model compatibility for an up-to-date list of the supported model families.
Note:
You can also specify the model name as part of the OpenAI token.
If only one model is available, the API will use it for all the requests.
To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Available additional parameters: top_p, top_k, max_tokens
List models
You can list all the models available with:
curl http://localhost:8080/v1/models
π Audio to text
The transcription endpoint allows to convert audio files to text. The endpoint is based on whisper.cpp, a C++ library for audio transcription. The endpoint supports the audio formats supported by ffmpeg.
Usage
Once LocalAI is started and whisper models are installed, you can use the /v1/audio/transcriptions API endpoint.
The transcriptions endpoint then can be tested like so:
## Get an example audio filewget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
## Send the example audio file to the transcriptions endpointcurl http://localhost:8080/v1/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@$PWD/gb1.ogg" -F model="whisper-1"## Result{"text":"My fellow Americans, this day has brought terrible news and great sadness to our country.At nine o'clock this morning, Mission Control in Houston lost contact with our Space ShuttleColumbia.A short time later, debris was seen falling from the skies above Texas.The Columbia's lost.There are no survivors.One board was a crew of seven.Colonel Rick Husband, Lieutenant Colonel Michael Anderson, Commander Laurel Clark, Captain DavidBrown, Commander William McCool, Dr. Kultna Shavla, and Elon Ramon, a colonel in the IsraeliAir Force.These men and women assumed great risk in the service to all humanity.In an age when spaceflight has come to seem almost routine, it is easy to overlook thedangers of travel by rocket and the difficulties of navigating the fierce outer atmosphere ofthe Earth.These astronauts knew the dangers, and they faced them willingly, knowing they had a highand noble purpose in life.Because of their courage and daring and idealism, we will miss them all the more.All Americans today are thinking as well of the families of these men and women who havebeen given this sudden shock and grief.You're not alone.Our entire nation agrees with you, and those you loved will always have the respect andgratitude of this country.The cause in which they died will continue.Mankind has led into the darkness beyond our world by the inspiration of discovery andthe longing to understand.Our journey into space will go on.In the skies today, we saw destruction and tragedy.As farther than we can see, there is comfort and hope.In the words of the prophet Isaiah, \"Lift your eyes and look to the heavens who createdall these, he who brings out the starry hosts one by one and calls them each by name.\"Because of his great power and mighty strength, not one of them is missing.The same creator who names the stars also knows the names of the seven souls we mourntoday.The crew of the shuttle Columbia did not return safely to Earth yet we can pray that all aresafely home.May God bless the grieving families and may God continue to bless America.[BLANK_AUDIO]"}
π₯ OpenAI functions
LocalAI supports running OpenAI functions with llama.cpp compatible models.
π‘ Check out also LocalAGI for an example on how to use LocalAI functions.
Setup
OpenAI functions are available only with ggml models compatible with llama.cpp.
Specify the llama backend in the model YAML configuration file:
name:openllamaparameters:model:ggml-openllama.bintop_p:80top_k:0.9temperature:0.1backend:llama# Set the `llama` backend
Usage example
To use the functions with the OpenAI client in python:
importopenai# ...# Send the conversation and available functions to GPTmessages=[{"role":"user","content":"What's the weather like in Boston?"}]functions=[{"name":"get_current_weather","description":"Get the current weather in a given location","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city and state, e.g. San Francisco, CA",},"unit":{"type":"string","enum":["celsius","fahrenheit"]},},"required":["location"],},}]response=openai.ChatCompletion.create(model="gpt-3.5-turbo",messages=messages,functions=functions,function_call="auto",)# ...
Note
When running the python script, be sure to:
Set OPENAI_API_KEY environment variable to a random string (the OpenAI api key is NOT required!)
Set OPENAI_API_BASE to point to your LocalAI service, for example OPENAI_API_BASE=http://localhost:8080
Advanced
It is possible to also specify the full function signature (for debugging, or to use with other clients).
The chat endpoint accepts the grammar_json_functions additional parameter which takes a JSON schema object.
The huggingface backend is an optional backend of LocalAI and uses Python. If you are running LocalAI from the containers you are good to go and should be already configured for use. If you are running LocalAI manually you must install the python dependencies (pip install -r /path/to/LocalAI/extra/requirements) and specify the extra backend in the EXTERNAL_GRPC_BACKENDS environment variable ( EXTERNAL_GRPC_BACKENDS="huggingface-embeddings:/path/to/LocalAI/extra/grpc/huggingface/huggingface.py" ) .
The huggingface backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the bert backend or llama.cpp.
No models are required to be downloaded before using the huggingface backend. The models will be downloaded automatically the first time the API is used.
Llama.cpp embeddings
Embeddings with llama.cpp are supported with the llama backend.
Example that uses LLamaIndex and LocalAI as embedding: here.
βοΈ Constrained grammars
The chat endpoint accepts an additional grammar parameter which takes a BNF defined grammar.
This allows the LLM to constrain the output to a user-defined schema, allowing to generate JSON, YAML, and everything that can be defined with a BNF grammar.
LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details.
Hardware requirements
Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also here for ggml based backends. rwkv is less expensive on resources.
Model compatibility table
Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.
Note: You might need to convert some models from older models to the new format, for indications, see the README in llama.cpp for instance to run gpt4all.
Subsections of Model compatibility
RWKV
A full example on how to run a rwkv model is in the examples.
Note: rwkv models needs to specify the backend rwkv in the YAML config files and have an associated tokenizer along that needs to be provided with it:
36464540 -rw-r--r-- 1 mudler mudler 1.2G May 3 10:51 rwkv_small
36464543 -rw-r--r-- 1 mudler mudler 2.4M May 3 10:51 rwkv_small.tokenizer.json
π¦ llama.cpp
llama.cpp is a popular port of Facebook’s LLaMA model in C/C++.
Note
The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use the llama-stable backend instead. If you are relying in automatic detection of the model, you should be fine.
Features
The llama.cpp model supports the following features:
Prompt templates are useful for models that are fine-tuned towards a specific prompt.
Automatic setup
LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml models.
For instance, if you have the galleries enabled, you can just start chatting with models in huggingface by running:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.1
}'
LocalAI will automatically download and configure the model in the model directory.
Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the model gallery documentation.
YAML configuration
To use the llama.cpp backend, specify
π π¦ Exllama
Exllama is a “A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights”
Prerequisites
This is an extra backend - in the container images is already available and there is nothing to do for the setup.
If you are building LocalAI locally, you need to install exllama manually first.
Model setup
Download the model as a folder inside the model directory and create a YAML file specifying the exllama backend. For instance with the TheBloke/WizardLM-7B-uncensored-GPTQ model:
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Prerequisites
This is an extra backend - in the container images is already available and there is nothing to do for the setup.
If you are building LocalAI locally, you need to install AutoGPTQ manually.
Model setup
The models are automatically downloaded from huggingface if not present the first time. It is possible to define models via YAML config file, or just by querying the endpoint with the huggingface repository model name. For example, create a YAML config file in models/:
Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the diffusers library.
Note: currently only the image generation is supported. It is experimental, so you might encounter some issues on models which weren’t tested yet.
Setup
This is an extra backend - in the container is already available and there is nothing to do for the setup.
Model setup
The models will be downloaded the first time you use the backend from huggingface automatically.
Create a model configuration file in the models directory, for instance to use Linaqruf/animagine-xl with CPU:
name:animagine-xlparameters:model:Linaqruf/animagine-xlbackend:diffusers# Force CPU usage - set to true for GPUf16:falsediffusers:pipeline_type:StableDiffusionXLPipelinecuda:false# Enable for GPU usage (CUDA)scheduler_type:euler_a
Local models
You can also use local models, or modify some parameters like clip_skip, scheduler_type, for instance:
name:stablediffusion-depthparameters:model:stabilityai/stable-diffusion-2-depthbackend:diffusersstep:50# Force CPU usagef16:truediffusers:pipeline_type:StableDiffusionDepth2ImgPipelinecuda:trueenable_parameters:"negative_prompt,num_inference_steps,image"cfg_scale:6
In order to build the LocalAI container image locally you can use docker:
# build the image
docker build -t localai .
docker run localai
Or you can build the manually binary with make:
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
make build
To run: ./local-ai
Note
CPU flagset compatibility
LocalAI uses different backends based on ggml and llama.cpp to run models. If your CPU doesn’t support common instruction sets, you can disable them during build:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF" make build
To have effect on the container image, you need to set REBUILD=true:
Building on Mac (M1 or M2) works, but you may need to install some prerequisites using brew.
The below has been tested by one mac user and found to work. Note that this doesn’t use Docker to run the server:
# install build dependencies
brew install cmake
brew install go
# clone the repo
git clone https://github.com/go-skynet/LocalAI.git
cd LocalAI
# build the binary
make build
# Download gpt4all-j to models/
wget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j
# Use a template from the examples
cp -rf prompt-templates/ggml-gpt4all-j.tmpl models/
# Run LocalAI
./local-ai --models-path ./models/ --debug
# Now API is accessible at localhost:8080
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ggml-gpt4all-j",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9
}'
Build with Image generation support
Requirements: OpenCV, Gomp
Image generation is experimental and requires GO_TAGS=stablediffusion to be set during build:
make GO_TAGS=stablediffusion build
Build with Text to audio support
Requirements: piper-phonemize
Text to audio support is experimental and requires GO_TAGS=tts to be set during build:
make GO_TAGS=tts build
Acceleration
List of the variables available to customize the build:
Variable
Default
Description
BUILD_TYPE
None
Build type. Available: cublas, openblas, clblas, metal
GO_TAGS
tts stablediffusion
Go tags. Available: stablediffusion, tts
CLBLAST_DIR
Specify a CLBlast directory
CUDA_LIBPATH
Specify a CUDA library path
OpenBLAS
Software acceleration.
Requirements: OpenBLAS
make BUILD_TYPE=openblas build
CuBLAS
Nvidia Acceleration.
Requirement: Nvidia CUDA toolkit
Note: CuBLAS support is experimental, and has not been tested on real HW. please report any issues you find!
make BUILD_TYPE=metal build
# Set `gpu_layers: 1` to your YAML model config file and `f16: true`
# Note: only models quantized with q4_0 are supported!
In order to define default prompts, model parameters (such as custom default top_p or top_k), LocalAI can be configured to serve user-defined models with a set of default parameters and templates.
You can create multiple yaml files in the models path or either specify a single YAML configuration file.
Consider the following models folder in the example/chatbot-ui:
In the gpt-3.5-turbo.yaml file it is defined the gpt-3.5-turbo model which is an alias to use luna-ai-llama2 with pre-defined options.
For instance, consider the following that declares gpt-3.5-turbo backed by the luna-ai-llama2 model:
name:gpt-3.5-turbo# Default model parametersparameters:# Relative to the models pathmodel:luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin# temperaturetemperature:0.3# all the OpenAI request options here..# Default context sizecontext_size:512threads:10# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.backend: llama-stable # available:llama, stablelm, gpt2, gptj rwkv# Enable prompt cachingprompt_cache_path:"alpaca-cache"prompt_cache_all:true# stopwords (if supported by the backend)stopwords:- "HUMAN:"- "### Response:"# define chat rolesroles:assistant:'### Response:'system:'### System Instruction:'user:'### Instruction:'template:# template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the filescompletion:completionchat:chat
Specifying a config-file via CLI allows to declare models in a single file as a list, for instance:
See also chatbot-ui as an example on how to use config files.
Full config model file reference
# Model name.# The model name is used to identify the model in the API calls.name:gpt-3.5-turbo# Default model parameters.# These options can also be specified in the API callsparameters:# Relative to the models pathmodel:luna-ai-llama2-uncensored.ggmlv3.q5_K_M.bin# temperaturetemperature:0.3# all the OpenAI request options here..top_k:top_p:max_tokens:batch:f16:trueignore_eos:truen_keep:10seed:mode:step:negative_prompt:typical_p:tfz:frequency_penalty:mirostat_eta:mirostat_tau:mirostat:rope_freq_base:rope_freq_scale:negative_prompt_scale:# Default context sizecontext_size:512# Default number of threadsthreads:10# Define a backend (optional). By default it will try to guess the backend the first time the model is interacted with.backend: llama-stable # available:llama, stablelm, gpt2, gptj rwkv# stopwords (if supported by the backend)stopwords:- "HUMAN:"- "### Response:"# string to trim space totrimspace:- string# Strings to cut from the responsecutstrings:- "string"# Directory used to store additional assetsasset_dir:""# define chat rolesroles:user:"HUMAN:"system:"GPT:"assistant:"ASSISTANT:"template:# template file ".tmpl" with the prompt template to use by default on the endpoint call. Note there is no extension in the filescompletion:completionchat:chatedit:edit_templatefunction:function_templatefunction:disable_no_action:trueno_action_function_name:"reply"no_action_description_name:"Reply to the AI assistant"system_prompt:rms_norm_eps:# Set it to 8 for llama2 70bngqa:1## LLAMA specific options# Enable F16 if backend supports itf16:true# Enable debuggingdebug:true# Enable embeddingsembeddings:true# Mirostat configuration (llama.cpp only)mirostat_eta:0.8mirostat_tau:0.9mirostat:1# GPU Layers (only used when built with cublas)gpu_layers:22# Enable memory lockmmlock:true# GPU setting to split the tensor in multiple parts and define a main GPU# see llama.cpp for usagetensor_split:""main_gpu:""# Define a prompt cache path (relative to the models)prompt_cache_path:"prompt-cache"# Cache all the promptsprompt_cache_all:true# Read onlyprompt_cache_ro:false# Enable mmapmmap:true# Enable low vram mode (GPU only)low_vram:true# Set NUMA mode (CPU only)numa:true# Lora settingslora_adapter:"/path/to/lora/adapter"lora_base:"/path/to/lora/base"# Disable mulmatq (CUDA)no_mulmatq:true
You can use a default template for every model present in your model path, by creating a corresponding file with the `.tmpl` suffix next to your model. For instance, if the model is called `foo.bin`, you can create a sibling file, `foo.bin.tmpl` which will be used as a default prompt and can be used with alpaca:
The below instruction describes a task. Write a response that appropriately completes the request.
### Instruction:
{{.Input}}
### Response:
See the prompt-templates directory in this repository for templates for some of the most popular models.
For the edit endpoint, an example template for alpaca-based models can be:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.### Instruction:{{.Instruction}}### Input:{{.Input}}### Response:
Install models using the API
Instead of installing models manually, you can use the LocalAI API endpoints and a model definition to install programmatically via API models in runtime.
A curated collection of model files is in the model-gallery (work in progress!). The files of the model gallery are different from the model files used to configure LocalAI models. The model gallery files contains information about the model setup, and the files necessary to run the model locally.
To install for example lunademo, you can send a POST call to the /models/apply endpoint with the model definition url (url) and the name of the model should have in LocalAI (name, optional):
PRELOAD_MODELS (or --preload-models) takes a list in JSON with the same parameter of the API calls of the /models/apply endpoint.
Similarly it can be specified a path to a YAML configuration file containing a list of models with PRELOAD_MODELS_CONFIG ( or --preload-models-config ):
LocalAI can automatically cache prompts for faster loading of the prompt. This can be useful if your model need a prompt template with prefixed text in the prompt before the input.
To enable prompt caching, you can control the settings in the model config YAML file:
prompt_cache_path is relative to the models folder. you can enter here a name for the file that will be automatically create during the first load if prompt_cache_all is set to true.
Configuring a specific backend for the model
By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.
In order to specify a backend for your models, create a model config file in your models directory specifying the backend:
name:gpt-3.5-turbo# Default model parametersparameters:# Relative to the models pathmodel:...backend:llama-stable# ...
Connect external backends
LocalAI backends are internally implemented using gRPC services. This also allows LocalAI to connect to external gRPC services on start and extend LocalAI functionalities via third-party binaries.
The --external-grpc-backends parameter in the CLI can be used either to specify a local backend (a file) or a remote URL. The syntax is <BACKEND_NAME>:<BACKEND_URI>. Once LocalAI is started with it, the new backend name will be available for all the API endpoints.
So for instance, to register a new backend which is a local file:
When LocalAI runs in a container, there are additional environment variables available that modify the behavior of LocalAI on startup:
Environment variable
Default
Description
REBUILD
true
Rebuild LocalAI on startup
BUILD_TYPE
Build type. Available: cublas, openblas, clblas
GO_TAGS
Go tags. Available: stablediffusion
HUGGINGFACEHUB_API_TOKEN
Special token for interacting with HuggingFace Inference API, required only when using the langchain-huggingface backend
πΌοΈ Model gallery
The model gallery is a (experimental!) collection of models configurations for LocalAI.
LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the models directory, or use the API to configure, download and verify the model assets for you. As the UI is still a work in progress, you will find here the documentation about the API Endpoints.
Note
The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.
Note
GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.
Useful Links and resources
Open LLM Leaderboard - here you can find a list of the most performing models on the Open LLM benchmark. Keep in mind models compatible with LocalAI must be quantized in the ggml format.
Model repositories
You can install a model in runtime, while the API is running and it is started already, or before starting the API by preloading the models.
To install a model in runtime you will need to use the /models/apply LocalAI API endpoint.
To enable the model-gallery repository you need to start local-ai with the GALLERIES environment variable:
where github:go-skynet/model-gallery/index.yaml will be expanded automatically to https://raw.githubusercontent.com/go-skynet/model-gallery/main/index.yaml.
Note
As this feature is experimental, you need to run local-ai with a list of GALLERIES. Currently there are two galleries:
An official one, containing only definitions and models with a clear LICENSE to avoid any dmca infringment. As I’m not sure what’s the best action to do in this case, I’m not going to include any model that is not clearly licensed in this repository which is offically linked to LocalAI.
A “community” one that contains an index of huggingface models that are compatible with the ggml format and lives in the localai-huggingface-zoo repository.
To enable the two repositories, start LocalAI with the GALLERIES environment variable:
If running with docker-compose, simply edit the .env file and uncomment the GALLERIES variable, and add the one you want to use.
Note
You might not find all the models in this gallery. Automated CI updates the gallery automatically. You can find however most of the models on huggingface (https://huggingface.co/), generally it should be available ~24h after upload.
By under any circumstances LocalAI and any developer is not responsible for the models in this gallery, as CI is just indexing them and providing a convenient way to install with an automatic configuration with a consistent API. Don’t install models from authors you don’t trust, and, check the appropriate license for your use case. Models are automatically indexed and hosted on huggingface (https://huggingface.co/). For any issue with the models, please open an issue on the model gallery repository if it’s a LocalAI misconfiguration, otherwise refer to the huggingface repository. If you think a model should not be listed, please reach to us and we will remove it from the gallery.
Note
There is no documentation yet on how to build a gallery or a repository - but you can find an example in the model-gallery repository.
List Models
To list all the available models, use the /models/available endpoint:
Models can be installed by passing the full URL of the YAML config file, or either an identifier of the model in the gallery. The gallery is a repository of models that can be installed by passing the model name.
To install a model from the gallery repository, you can pass the model name in the id field. For instance, to install the bert-embeddings model, you can use the following command:
model-gallery is the repository. It is optional and can be omitted. If the repository is omitted LocalAI will search the model by name in all the repositories. In the case the same model name is present in both galleries the first match wins.
bert-embeddings is the model name in the gallery
(read its config here).
Note
If the huggingface model gallery is enabled (it’s enabled by default),
and the model has an entry in the model gallery’s associated YAML config
(for huggingface, see model-gallery/huggingface.yaml),
you can install models by specifying directly the model’s id.
For example, to install wizardlm superhot:
Note that the id can be used similarly when pre-loading models at start.
How to install a model (without a gallery)
If you don’t want to set any gallery repository, you can still install models by loading a model configuration file.
In the body of the request you must specify the model configuration file URL (url), optionally a name to install the model (name), extra files to install (files), and configuration overrides (overrides). When calling the API endpoint, LocalAI will download the models files and write the configuration to the folder used to store models.
LOCALAI=http://localhost:8080
curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
"url": "<MODEL_CONFIG_FILE>"
}'# or if from a repositorycurl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
"id": "<GALLERY>@<MODEL_NAME>"
}'
To preload models on start instead you can use the PRELOAD_MODELS environment variable.
To preload models on start, use the PRELOAD_MODELS environment variable by setting it to a JSON array of model uri:
PRELOAD_MODELS='[{"url": "<MODEL_URL>"}]'
Note: url or id must be specified. url is used to a url to a model gallery configuration, while an id is used to refer to models inside repositories. If both are specified, the id will be used.
While the API is running, you can install the model by using the /models/apply endpoint and point it to the stablediffusion model in the models-gallery:
LocalAI will create a batch process that downloads the required files from a model definition and automatically reload itself to include the new model.
Input: url or id (required), name (optional), files (optional)
An optional, list of additional files can be specified to be downloaded within files. The name allows to override the model name. Finally it is possible to override the model config file with override.
The url is a full URL, or a github url (github:org/repo/file.yaml), or a local file (file:///path/to/file.yaml).
The id is a string in the form <GALLERY>@<MODEL_NAME>, where <GALLERY> is the name of the gallery, and <MODEL_NAME> is the name of the model in the gallery. Galleries can be specified during startup with the GALLERIES environment variable.
Returns an uuid and an url to follow up the state of the process:
Feel free to open up a PR to get your project listed!
FAQ
Frequently asked questions
Here are answers to some of the most common questions.
How do I get models?
Most ggml-based models should work, but newer models may require additions to the API. If a model doesn’t work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=ggml, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.
What’s the difference with Serge, or XXX?
LocalAI is a multi-model solution that doesn’t focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference, easy to set up locally and deploy to Kubernetes.
Everything is slow, how come?
There are few situation why this could occur. Some tips are:
Don’t use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable mmap in the model config file so it loads everything in memory.
Watch out CPU overbooking. Ideally the --threads should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate <= 4 threads to a model.
Run LocalAI with DEBUG=true. This gives more information, including stats on the token inference speed.
Check that you are actually getting an output: run a simple curl request with "stream": true to see how fast the model is responding.
Can I use it with a Discord bot, or XXX?
Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!
Can this leverage GPUs?
There is partial GPU support, see build instructions above.
Where is the webUI?
There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
Enable the debug mode by setting DEBUG=true in the environment variables. This will give you more information on what’s going on.
You can also specify --debug in the command line.
I’m getting ‘invalid pitch’ error when running with CUDA, what’s wrong?
This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.
I’m getting a ‘SIGILL’ error, what’s wrong?
Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting REBUILD=true and disable the CPU instructions that are not compatible with your CPU. For instance: CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" make build
How-tos
How-tos
This section includes LocalAI end-to-end examples, tutorial and how-tos curated by the community and maintained by lunamidori5.
In the "lunademo.yaml" file (If you want to see advanced yaml configs - Link)
backend:llamacontext_size:2000f16:true## If you are using cpu set this to falsegpu_layers:4name:lunademoparameters:model:luna-ai-llama2-uncensored.Q4_0.gguftemperature:0.2top_k:40top_p:0.65roles:assistant:'ASSISTANT:'system:'SYSTEM:'user:'USER:'template:chat:lunademo-chatcompletion:lunademo-completion
Now that we have that fully set up, we need to reboot the docker. Go back to the localai folder and run
docker-compose restart
Now that we got that setup, lets test it out but sending a request by using Curl Or use the Openai Python API!
The below command requires the Docker container already running,
and uses the Model Gallery to download the model.
it may also set up a model YAML config file,
but we will need to override that for this how to setup!
In the "lunademo.yaml" file (If you want to see advanced yaml configs - Link)
backend:llamacontext_size:2000f16:true## If you are using cpu set this to falsegpu_layers:4name:lunademoparameters:model:luna-ai-llama2-uncensored.Q4_0.gguftemperature:0.2top_k:40top_p:0.65roles:assistant:'ASSISTANT:'system:'SYSTEM:'user:'USER:'template:chat:lunademo-chatcompletion:lunademo-completion
Now that we have that fully set up, we need to reboot the Docker container. Go back to the localai folder and run
docker-compose restart
Now that we got that setup, lets test it out but sending a request by using Curl Or use the Openai Python API!
See OpenAI API for more info!
Have fun using LocalAI!
Easy Request - Openai
Now we can make a openai request!
OpenAI Chat API Python -
importosimportopenaiopenai.api_base="http://localhost:8080/v1"openai.api_key="sx-xxx"OPENAI_API_KEY="sx-xxx"os.environ['OPENAI_API_KEY']=OPENAI_API_KEYcompletion=openai.ChatCompletion.create(model="lunademo",messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"How are you?"}])print(completion.choices[0].message.content)
See OpenAI API for more info!
Have fun using LocalAI!
Easy Setup - CPU Docker
We are going to run LocalAI with docker-compose for this set up.
Lets clone LocalAI with git.
git clone https://github.com/go-skynet/LocalAI
Then we will cd into the LocalAI folder.
cd LocalAI
At this point we want to set up our .env file, here is a copy for you to use if you wish, please make sure to set it to the same as in the docker-compose file for later.
## Set number of threads.## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.THREADS=2## Specify a different bind address (defaults to ":8080")# ADDRESS=127.0.0.1:8080## Default models context size# CONTEXT_SIZE=512### Define galleries.## models will to install will be visible in `/models/available`GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]## CORS settings# CORS=true# CORS_ALLOW_ORIGINS=*## Default path for models#MODELS_PATH=/models
## Enable debug modeDEBUG=true## Specify a build type. Available: cublas, openblas, clblas.# Do not uncomment this as we are using CPU:# BUILD_TYPE=cublas## Uncomment and set to true to enable rebuilding from sourceREBUILD=true## Enable go tags, available: stablediffusion, tts## stablediffusion: image generation with stablediffusion## tts: enables text-to-speech with go-piper ## (requires REBUILD=true)##GO_TAGS=tts## Path where to store generated images# IMAGE_PATH=/tmp## Specify a default upload limit in MB (whisper)# UPLOAD_LIMIT# HUGGINGFACEHUB_API_TOKEN=Token here
Now that we have the .env set lets set up our docker-compose file.
It will use a container from quay.io.
Also note this docker-compose file is for CPU only.
version: '3.6'services: api: image: quay.io/go-skynet/local-ai:master tty: true# enable colorized logs restart: always # should this be on-failure ? ports: - 8080:8080 env_file: - .env volumes: - ./models:/models command: ["/usr/bin/local-ai"]
Make sure to save that in the root of the LocalAI folder. Then lets spin up the Docker run this in a CMD or BASH
docker-compose up -d --pull always
Now we are going to let that set up, once it is done, lets check to make sure our huggingface / localai galleries are working (wait until you see this screen to do this)
There is a Full_Auto installer compatible with some types of Linux distributions, feel free to use them, but note that they may not fully work. If you need to install something, please use the links at the top.
git clone https://github.com/lunamidori5/localai-lunademo.git
cd localai-lunademo
#Pick your type of linux for the Full Autos, if you already have python, docker, and docker-compose installed skip this chmod. But make sure you chmod the setup_linux file.chmod +x Full_Auto_setup_Debian.sh or chmod +x Full_Auto_setup_Ubutnu.sh
chmod +x Setup_Linux.sh
#Make sure to install cuda to your host OS and to Docker if you plan on using GPU./(the setupfile you wish to run)
Windows Hosts:
REM Make sure you have git, docker-desktop, and python 3.11 installedgit clone https://github.com/lunamidori5/localai-lunademo.git
cd localai-lunademo
call Setup.bat
MacOS Hosts:
I need some help working on a MacOS Setup file, if you are willing to help out, please contact Luna Midori on discord or put in a PR on Luna Midori’s github.
Video How Tos
Ubuntu - COMING SOON
Debian - COMING SOON
Windows - COMING SOON
MacOS - PLANED - NEED HELP
Enjoy localai! (If you need help contact Luna Midori on Discord)
Trying to run Setup.bat or Setup_Linux.sh from Git Bash on Windows is not working.
Running over SSH or other remote command line based apps may bug out, load slowly, or crash.
There seems to be a bug with docker-compose not running. (Main.py workaround added)
Easy Setup - Embeddings
To install an embedding model, run the following command
When you would like to request the model from CLI you can do
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json"\
-d '{
"input": "The food was delicious and the waiter...",
"model": "text-embedding-ada-002"
}'
We are going to run LocalAI with docker-compose for this set up.
Lets clone LocalAI with git.
git clone https://github.com/go-skynet/LocalAI
Then we will cd into the LocalAI folder.
cd LocalAI
At this point we want to set up our .env file, here is a copy for you to use if you wish, please make sure to set it to the same as in the docker-compose file for later.
## Set number of threads.## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.THREADS=2## Specify a different bind address (defaults to ":8080")# ADDRESS=127.0.0.1:8080## Default models context size# CONTEXT_SIZE=512### Define galleries.## models will to install will be visible in `/models/available`GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]## CORS settings# CORS=true# CORS_ALLOW_ORIGINS=*## Default path for models#MODELS_PATH=/models
## Enable debug modeDEBUG=true## Specify a build type. Available: cublas, openblas, clblas.BUILD_TYPE=cublas
## Uncomment and set to true to enable rebuilding from sourceREBUILD=true## Enable go tags, available: stablediffusion, tts## stablediffusion: image generation with stablediffusion## tts: enables text-to-speech with go-piper ## (requires REBUILD=true)##GO_TAGS=tts## Path where to store generated images# IMAGE_PATH=/tmp## Specify a default upload limit in MB (whisper)# UPLOAD_LIMIT# HUGGINGFACEHUB_API_TOKEN=Token here
Now that we have the .env set lets set up our docker-compose file.
It will use a container from quay.io.
Also note this docker-compose file is for CUDA only.
Make sure to save that in the root of the LocalAI folder. Then lets spin up the Docker run this in a CMD or BASH
docker-compose up -d --pull always
Now we are going to let that set up, once it is done, lets check to make sure our huggingface / localai galleries are working (wait until you see this screen to do this)