Running AI Locally and Privately for Free

No need to sacrifice your privacy in order to take advantage of AI and LLMs

Apr 02, 2024

∙ Paid

I went down a rabbit hole recently after learning that many LLM models can be run on local hardware as long as it is strong enough. The last time I tried this was with Stable Diffusion on my poor 1060 graphics card that took 1 minute to generate an image. I was happy to find out that the latest MacBooks have plenty of support and hardware to run the LLMs locally without being too taxing.

The biggest issue with AI tools such as ChatGPT are the lack of guarantees around data privacy. You have no say in whether AI will be trained on your prompts and inputs unless you are a higher end customer who pays for the benefit. OpenAI has shown to be a questionable steward for the AI movement as a “nonprofit” company and we should always question the motives of Big Tech with their censorship of models. Thankfully open source LLMs have improved significantly due to efforts of many contributors on the internet. One such model that can be used for coding specific work is OpenCodeInterpreter which has some benchmarks that have it competing with ChatGPT4.0 in coding scenarios. You can see more info on their site here.

For context my hardware is an M3 Pro MacBook with 18GB unified RAM. You’ll need to adjust what size models you use depending on your hardware as there are smaller and bigger models to choose from and your RAM will influence that.

We are going to use Ollama, which helps download and run models locally and offers additional functionality such as API endpoints, code libraries, base prompt retraining, and OpenAI compatibility. The software is reminiscent of using Docker in many ways as you can easily download multiple models and run them at will all with a few simple commands.

Executive Summary

Download ollama
In terminal run: ollama run pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M
You now have a local, private chatbot in the terminal
See integrations you can now use to get a native chat UI and more

Demo and Install

To run the models locally you will need to install Ollama which you can find a download link here: https://ollama.com/. I’m using this on Mac and ran into no issues in the process, but if you’re on Windows just note the support is recent for it and still in preview.

After installation you can verify it worked by running ollama -v and confirm the application can be found.

Now you can execute ollama run llama2 in the terminal and you will be kicked into the chat prompt within the terminal. I asked it to generate all prime numbers up to 100 in Go and it successfully output the program, which I ran locally in my terminal with no issue. You can see the performance below in real time.

Here is the code that was generated:

package main

import "fmt"

func main() {
    for i := 2; i < 100; i++ {
        if !isPrime(i) {
            continue
        }
        fmt.Println(i)
    }
}

func isPrime(n int) bool {
    if n < 2 {
        return false
    }
    for i := 2; i * i <= n; i++ {
        if n % i == 0 {
            return false
        }
    }
    return true
}

One thing to emphasize is that this was all done locally! No external API calls, no privacy breaches, no personal data leaks, and can run with censorship free models all done on the hardware you own with open source code and models!

While the terminal prompt is useful, we also have access to an API endpoint too.

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Write me code for the classic fizz buzz problem",
  "stream": false
}'

If we pipe the response to jq we get:

{
  "model": "llama2",
  "created_at": "2024-04-01T00:48:41.657866Z",
  "response": "Of course! Here is some sample Python code that prints out the numbers from 1 to 100, replacing multiples of 3 with \"Fizz\" and multiples of 5 with \"Buzz\":\n```\nfor num in range(1, 101):\n    if num % 3 == 0:\n        print(\"Fizz\")\n    elif num % 5 == 0:\n        print(\"Buzz\")\n    else:\n        print(num)\n```\nLet me know if you have any questions or need further clarification!",
  "done": true,
  "context": [...],
  "total_duration": 4301712292,
  "load_duration": 2555167,
  "prompt_eval_count": 16,
  "prompt_eval_duration": 268586000,
  "eval_count": 124,
  "eval_duration": 4029886000
}

A better look at the code:

for num in range(1, 101):
    if num % 3 == 0:     
        print(\"Fizz\")
    elif num % 5 == 0:       
        print(\"Buzz\")  
    else:   # some versions of fizz buzz sup
        print(num)

You can see more of the API capabilities here for integration here. When using the API use the keep_alive parameter to avoid the model shutting down between queries since by default it is only 5 minutes. Querying the model doesn’t seem to reuse the current model in a terminal session even if its the same model, so you will see longer response times up to 15 seconds due to your CPU and RAM being taken up already. If you kill the terminal model you should have faster API responses, though I have seen it still take around 10 seconds, but also be as short as 2 seconds. I’m not sure if this is an API only quirk as using the terminal interface has always been very fast for me.

They also have a python and javascript library for non http based IPC. Ollama also has partial support for OpenAI’s API format, meaning you can sub your local model’s endpoint for any OpenAI based integration.

Using the Coding Models

Now that we have a model running we can switch to more targeted and trained models for coding.

The Bit Shift

Running AI Locally and Privately for Free

No need to sacrifice your privacy in order to take advantage of AI and LLMs

Executive Summary

Demo and Install

Using the Coding Models

This post is for paid subscribers