You don’t need an expensive GPU to run a local LLM that actually works

We’ve all seen the news reports, stories, and videos surrounding AI development and large language models (LLMs) hosting with huge encompassing data centers. These things can draw similar amounts of power as small cities, and they only seem to be replicating worldwide as we slowly incorporate AI and chatbots more into our daily lives. As a tool, they’re great at offloading mundane tasks, quickly looking something up, or even interacting with other parts of your home. Throw in a smart home platform like Home Assistant and your own self-hosted LLM, and you’ve got one powerful local setup.

What is problematic when looking to self-host your own LLMs is the system resource requirements. These models, notably the larger ones with better context, require incredible amounts of memory, which is fine when we have the option to load up a motherboard with 128 GB, if not more. The issue there is that RAM is actually really slow, at least compared to VRAM on a GPU or CPU cache. The former is the best choice for running LLMs with Nvidia GPUs and tool sets leading the way.

But how much do you need to spend on a GPU to comfortably run an LLM with decent results? Turns out, it’s likely considerably less than you’d naturally assume. There’s a myth that you need to spend four figures on a GPU to make the most of a locally hosted LLM, but that couldn’t be further from the truth. Don’t believe what you see on social media and in videos with systems running RTX 5090 GPUs. All you need is a budget card or even a CPU if you don’t mind using smaller models with slightly longer wait times.

Bigger is better

But you don’t always need the best

Look, before you sound off in the comments about how I’m wrong and using the best GPU you can reasonably afford is the right way to go about hosting LLMs at home, I agree with you. An RTX 5090 is the best card for the job. It has plenty of memory and some beefy internals to handle processing. The RTX 3090 is considered the best bang for your buck, but still, that GPU costs a small fortune, especially in today’s world. But we won’t need to be running 32B models to make the most of LLM capabilities.

Higher-end GPUs unlock more performance for running better models, but even 7B or smaller options can prove useful when used appropriately. With the right hardware configuration, you can run genuinely useful local LLMs. Before you even consider hosting your own LLMs, it’s vital that you learn that getting one to work locally doesn’t equate to the latest GPT or Claude-level reasoning. It’s about getting a chatbot to respond in a reasonable time with up to 10 tokens per second, producing coherent and context-aware replies. These LLMs should be able to help you with everyday tasks.

5 ways I use a self-hosted LLM to help me be more organized and more productive

When used right, a local LLM is one of the best tools anyone can have in their arsenal.

It’s precisely how I have my own LLMs configured using Open WebUI, Ollama, and a Minisforum U850. This mini PC has an Intel Core i5-10210U processor. Eight physical cores, a maximum turbo speed of 4.2GHz, and plenty of cache to run some self-hosted apps and containers. So, I decided to launch two LLMs on the hardware, Qwen3:4b and Qwen2.5-coder:7b. These are small models but have proven useful in handling submitted queries. Turns out, I don’t require a 70B parameter model. Even using an RTX 4060 Ti with 16GB of VRAM was largely overkill.

That system alone pulled more than 400 watts total when running some processes. That’s fine for using AI and allows for some quick responses and solid context for the job, but it also ups the power bill considerably. This U850 mini PC draws just 50 watts. I’m saving almost 90% of the power and sacrificing token speed and context. Instead of around 25 tokens per second, I’m pushing around 5. It’s not fast, by any stretch, but it works surprisingly well, and that’s without a dedicated GPU.

A MacBook air connected to a monitor running DeepSeek-R1 locally

7 things I wish I knew when I started self-hosting LLMs

I’ve been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

Use what you already have

Old GPUs can work surprisingly well

There’s a whole culture within the community to push for the highest scores in leaderboards, utilizing overkill hardware that many can only dream of obtaining. This causes a gap to grow between what’s possible with the right parts and what’s actually useful in daily tasks. It’s why you often see laptops and other devices highlight AI capabilities with lower figures than dedicated GPUs. These tests use the largest models possible for the best accuracy scores, but all this noise doesn’t reflect daily usage. Infuencers, media, and other outlets likely use enthusiast-grade hardware.

This skews the perception of baseline requirements. You don’t even need a GPU to run an LLM. So long as your CPU is supported, you have enough RAM, and you choose a model wisely, you can enjoy using a locally hosted chatbot without spending a penny more. Training these LLMs takes much more power than inferencing, the latter of which happens locally. There are also countless tweaks you can make to get more from your hardware, which is something I continue to play with to see how far I can push the mini PC. But the magic of LLMs is quantization.

There are also countless tweaks you can make to get more from your hardware, which is something I continue to play with to see how far I can push the mini PC

By reducing model precision, moving from FP16 to 8-bit, for example, you can save massive amounts of memory usage for a slight accuracy hit. That 7B model I’m running on 16GB of RAM would typically require 14 or more at full precision. But at Q4 (4-bit quantized), you’re looking at around 5GB when I’m running qwen3:4b. That changes everything. Even with a powerful GPU at hand, quantization can be the difference between running a smaller model and a larger one with far better capabilities. Integrated graphics can be used, which can avoid dedicated cards altogether, but even an old GTX 1080 with 8GB of VRAM could make a huge difference.

Offloading as much to any GPU can really transform your self-hosted LLM experience. CPU-only setups are rather capable with optimized runtimes. Old GPUs can handle smaller models with great response times, and using the right model can be the difference between one with diminishing returns for general tasks and one that’s fast, efficient, and largely good enough. It’s important to remember that LLMs are continuously evolving. Small models are absolutely rocking it these days with vast improvements in training data, loads of fine-tuning, and instruction-following.

Running a llama.cpp server on a Raspberry Pi

I built a local LLM server I can access from anywhere, and it uses a Raspberry Pi

It may not replace ChatGPT, but it’s good enough for edge projects

Picking the right models

Use specific LLMs for different tasks

Take my setup here with the U850. I have qwen3:4b for general usage and integration with other platforms and qwen2.5-coder:7b for specifically aiding with coding and other related queries. General models can largely handle most tasks, but specific LLMs designed for use in these areas can outshine larger general counterparts, unlocking more performance without touching hardware or making a single tweak.

It’s a slightly slower Claude or ChatGPT, but nothing leaves my network.

It’s a slightly slower Claude or ChatGPT, but nothing leaves my network. I can completely lock down the LLMs, so no external access is allowed. That’s what makes this truly unique in that I can tap into these powerful models without an active net connection, and no data is shared with any third-party without my consent. It’s also great for travelling too, especially if your laptop has enough performance to run a model locally. Host an LLM within the OS, and you’ve got all that power at your fingertips, which could be several thousand feet up in the air.

But that’s not to say you shouldn’t aspire to run powerful GPUs for LLMs, and I likely will look to build a dedicated platform for Open WebUI and Ollama with a dedicated card for running larger models. Without going overboard, I could more than double the response speed, which would make it far more enjoyable to use for heavy coding or research conversations. But there’s no right or wrong setup for using locally-hosted LLMs. It all comes down to what you have available and which models you plan on running.

A person holding a GTX 1080 Founders Edition GPU

I ran local LLMs on a “dead” GPU, and the results surprised me

My Pascal card may not be ideal for intensive workloads, but it’s more than enough for light LLM-powered tasks

You don’t need an expensive GPU to run a local LLM that actually works

Bigger is better

But you don’t always need the best

You don’t need a beefy GPU to run a local LLM
Trivia challenge

Your Score

5 ways I use a self-hosted LLM to help me be more organized and more productive

7 things I wish I knew when I started self-hosting LLMs

Use what you already have

Old GPUs can work surprisingly well

I built a local LLM server I can access from anywhere, and it uses a Raspberry Pi

Picking the right models

Use specific LLMs for different tasks

I ran local LLMs on a “dead” GPU, and the results surprised me

Bigger is better

But you don’t always need the best

You don’t need a beefy GPU to run a local LLMTrivia challenge

Your Score

5 ways I use a self-hosted LLM to help me be more organized and more productive

7 things I wish I knew when I started self-hosting LLMs

Use what you already have

Old GPUs can work surprisingly well

I built a local LLM server I can access from anywhere, and it uses a Raspberry Pi

Picking the right models

Use specific LLMs for different tasks

I ran local LLMs on a “dead” GPU, and the results surprised me

You don’t need a beefy GPU to run a local LLM
Trivia challenge