Run even larger AI models locally with LM Studio

A few days ago, I wrote a post about how to run large language AI models on your local PC using Ollama. I am a big fan of Ollama, but I have been using a new tool that is even better for interactive use.

LM Studio offers many of the same features and ease of use as Ollama and a lot more. It runs on Windows, Mac, and Linux and can be used interactively, or as a server that mimics OpenAI's API.

I've been liking LM Studio so much, I think I am going to remove Ollama from my machine. I have been using Ollama interactively as well as a server for other processes. I even have Ollama linked into VS Code to act as my own version of Github Copilot.

Installing LM Studio

Super easy, barely an inconvience. Just go to https://lmstudio.ai/ and choose your operating system.

Using LM Studio

Once you have LM Studio installed, you are going to first need to download some models. Depending on your system, and how much VRAM you have, you choices may be limited. One of the great things about LM Studio is you can use VRAM along with your system ram (at a big performance penalty). This will allow you to use much larger and less quantized models.

I have an AMD 5950X with 64G DDR4 and an nVidia 3090 in my main system. This gives me 24G VRAM and 64G of system ram. One of the models I have been playing with lately is Dolphin-Mixtral, which is a MoE model. I'm not going to get into a MoE model in this post, but it is a newer approach to LLM that uses multiple smaller models to provide fine tuned experts to break up responses.

Let's look at my options for this model and my hardware.

First we got to go to the search models tab, and then select the 2.7 Mixtral version. This is the latest release for Dolphin-Mixtral.

At the top of LM Studio, you can see my resources available. Which is the amount I said above but with the current overhead factored in.

Select the model, and on the right you will see all the files associated with it.

You can see I have two installed which I chose specifically. The first one will fit entirely on my GPU and run at top performance, the other is a 4 bit quantized version which is a lot better but requires a lot of ram. If you look at the first model, it is a 2 bit quantized model, which means the precision is highly reduced resulting in more potentially inaccurate choices as you go through the neural network. I would recommend using a 4 bit model if possible, the source model is typically 16 bit, so anything less than this will have reduced accuracy. 4 bit is generally a good compromise while making it accessable with consumer hardware.

Let's try the 2 bit version, and see how that goes.

First you need to go into the chat tab, and then load the model up top.

On the right you will see some choices, for this model we are going to want to set GPU Layers to -1, this will force all layers onto the GPU, this is ideal for this model as it will fit on my 24G 3090. If your GPU can't fit it, you will get an error. You will also want to set the context window, this is how much data the model can reference. 2048 is the default, but the more tokens you have the further back the conversation can go. 2048 is a good starting point for most tasks, if you are consuming more information you may need to increase this.

The first prompt I am going to use to test, is:

I have a hot dog on a plate in the kitchen, I take the plate into the living room and sit down. Where is the hot dog?

Even the 2 bit model is able to answer this, despite many other models failing.

On the bottom, we can see some performance numbers to see how fast we are generating responses. 48 tokens per second is a very acceptable speed. In fact, this is faster than you can read.

I'm going to switch to the 4 bit quantized version, this version requires 26GB, just over my available VRam, so before I load it, I need to change the GPU layers parameter. I found through some testing I can offload 20 layers to the GPU and use most of the available VRAM.

This configuration though I lost a lot of performance, dropping down just under 8 tokens per second. This is still usable, and not as fast as most poeple can read, but not slow enough that you are waiting forever. Most of the model is fitted on the GPU, with a few layers done on the CPU and system ram.

I can tweak the settings a little bit, and get 22 layers on the GPU for a slight improvement but I can't get the last couple layers on the GPU due to the ram requirements. This gave me a slight increase in performance, but nothing major. Depending on how much VRAM you have, your results will vary. I can also increase the CPU threads to 12 ( I have 16 native cores on my CPU) to get similar performance without increasing layers.

Just as important as your prompt, is the system prompt you give the model before asking it a question. The default prompt is very simple and can be modified to suit your needs.

For example, you can give it a prompt "You are an expert lawyer, and your client gives you a call and asks a question. Please answer their questions to the best of your ability". You can save this as a preset "Lawyer".

LM Studio also exposes a lot more advanced settings you can use to tweak your experience. From my experience, LM Studio is a bit buggy, at least on the Linux beta and you may have better results from Ollama if the bugs creep up in your use.

For most people, the Dolphin Mixtral may be too big of a model to work with, and you might want to look at someting like Open Orca 7B. As always, you can explore models on Hugging Face, the goto stop for Open LLM models.