← All research

Running Ollama on an Old Dell XPS With an NVIDIA 3070: A Practical Local LLM Test Against Fireworks, Together.ai, and OVHcloud

April 17, 2026

CS
Colin Smillie

Founder, Developer, AI Researcher

Key finding

A Dell XPS desktop with an NVIDIA RTX 3070 and 8GB of VRAM, running Ollama with qwen2.5:7b-instruct, responded in 2.8 seconds. That was faster than Fireworks.ai (3.8s) and OVHcloud (9.9s), and within range of Together.ai (1.2s). Older consumer hardware can deliver competitive inference latency when paired with a right-sized model and proper configuration.

There is something appealing about taking an aging desktop and turning it into a usable private AI node.

In this case, I used my son's old Dell XPS desktop with an NVIDIA RTX 3070 and 8GB of VRAM to see how far I could push a local inference setup with Ollama. The goal was not to build the fastest AI stack in the world. It was to test whether older consumer hardware could still deliver acceptable latency for real work, while giving me more control over cost, privacy, and deployment.

After testing a few models, I settled on qwen2.5:7b-instruct as the best fit for the machine. With only 8GB of VRAM available, model size matters fast. Larger models were possible to experiment with, but they pushed the system harder, increased the chance of slower responses, and made the setup feel less practical for day-to-day use. Qwen2.5 7B hit the right balance between capability and feasibility on this hardware.

Why this experiment mattered

There are already excellent hosted inference providers. Services like Together.ai, Fireworks.ai, and OVHcloud make it easy to call powerful models through an API without worrying about local hardware, drivers, or runtime management.

But local inference still has real advantages. Lower marginal cost once the hardware is available. More control over where data goes. The ability to experiment without depending entirely on third-party endpoints. And a path toward lightweight private infrastructure for niche workloads.

I also wanted to see how close a local setup on old hardware could get to commercial API latency.

The hardware and setup

The machine was an older Dell XPS desktop with an NVIDIA RTX 3070 and 8GB of VRAM. That makes it a good test case for a lot of people who have a gaming PC or a retired workstation sitting around.

The local inference stack used:

  • Ollama for local model serving
  • qwen2.5:7b-instruct as the selected model
  • Tailscale to connect the machine securely to my VPS
  • A persistence approach in Ollama to keep the model loaded and avoid unnecessary cold start delays

That last point matters. One of the biggest issues with local inference is that if the model unloads between requests, latency gets worse fast. If the goal is to use the machine as a real service endpoint, not just a one-off experiment, you need to reduce the reload penalty. Keeping the model resident made the system much more usable.

Why Tailscale helped

Rather than expose the old Dell directly, I used Tailscale to connect it into my existing infrastructure through a private mesh network. That made it much easier to reach the Ollama instance securely from my VPS without opening ports to the public internet or building a more complicated VPN setup.

This pattern is especially useful if you want to experiment with hybrid infrastructure:

  • Public VPS for orchestration, APIs, or routing
  • Private local GPU machine for inference
  • Secure connection between the two without a lot of networking overhead

For small-scale projects, that is a surprisingly effective architecture.

Model testing and the decision to use qwen2.5:7b-instruct

With only 8GB of VRAM, model choice is constrained by reality, not ambition.

I tested a few options, but qwen2.5:7b-instruct came out as the practical winner for this box. It was small enough to run comfortably, capable enough to produce strong answers, and a better operational fit than trying to force a larger model into limited VRAM.

This is an important lesson in local AI infrastructure: the best model is not always the biggest one. It is the one that fits the hardware well enough to produce consistent, usable results.

Latency test

To get a rough comparison, I ran the same question across four providers: Fireworks.ai, Together.ai, OVHcloud, and Ollama running locally on the Dell XPS. The prompt was: “What is a zoning certificate in Toronto?”

ProviderModelLatencyTokensAnswer Length
Together.ai(fastest)Qwen2.5-7B-Turbo1.2s2,098633 chars
Ollama (local)qwen2.5:7b-instruct2.8s2,148942 chars
Fireworks.aiQwen3-8B3.8s2,4381,157 chars
OVHcloudQwen3-32B9.9s2,368829 chars

What the results suggest

The first takeaway is that Together.ai was the fastest in this test at 1206 ms.

The second is that local Ollama performed surprisingly well at 2793 ms. It was slower than Together.ai, but faster than Fireworks.ai in this run, and dramatically faster than OVHcloud.

That is a useful result. A recycled desktop with a 3070 is not supposed to beat the convenience of hosted inference. But it does not need to. It only needs to be good enough to justify itself for certain workloads. In this case, it looked viable.

A few observations stand out:

  • Together.ai delivered the best latency in this sample
  • Ollama on local hardware was competitive enough to be interesting
  • Fireworks.ai was slower than the local setup in this specific run
  • OVHcloud showed the highest latency by a wide margin

This does not prove that one provider is always better than another. It only shows what happened in this particular test, with this prompt, at this moment. But it does reinforce the idea that a modest local GPU machine can hold its own better than many people expect.

The hidden value of local inference

The value of a setup like this is not just latency. It is also about control.

A local Ollama node gives you:

  • Predictable access to a model you choose
  • No per-request API billing
  • More privacy for sensitive prompts
  • Flexibility to experiment with deployment patterns
  • A better understanding of what modern inference actually requires

It also changes the economics for hobby projects, internal tools, and research environments. If you already have the hardware, the question is no longer “can I afford to test this?” but “is the operational complexity worth it?”

For many experiments, the answer is yes.

Avoiding cold starts matters more than people think

One of the more important practical lessons was the need to keep the model loaded.

If Ollama has to reload the model repeatedly, the user experience degrades fast. That makes any local setup feel slower than it really is. Persisting the model in memory reduced that issue and made the machine behave more like a steady service than a hobby box.

That is a key difference between a demo and an actual usable endpoint.

Final thoughts

This experiment made one thing clear: older consumer hardware still has real value in the local AI stack.

A Dell XPS with an RTX 3070 and 8GB of VRAM is not a datacenter server. But paired with Ollama, connected through Tailscale, and running a right-sized model like qwen2.5:7b-instruct, it can become a credible inference node.

For anyone exploring private AI infrastructure, hybrid deployments, or just trying to stretch the value of hardware they already own, this kind of setup is worth testing.

Hosted inference will still win on convenience, scale, and often latency. But local inference is no longer just a novelty. With the right model and a practical architecture, it can be good enough to matter.

Frequently asked questions

Can you run Ollama on an NVIDIA RTX 3070?

Yes. The RTX 3070 with 8GB of VRAM can run 7B parameter models like qwen2.5:7b-instruct comfortably through Ollama. Larger models are possible but push the hardware harder and produce less consistent response times.

How fast is local Ollama compared to cloud inference providers?

In our test, Ollama on a Dell XPS with an RTX 3070 responded in 2.8 seconds, faster than Fireworks.ai at 3.8 seconds and OVHcloud at 9.9 seconds. Together.ai was the fastest at 1.2 seconds. Local inference was competitive with commercial APIs.

What LLM model works best with 8GB of VRAM?

qwen2.5:7b-instruct was the best fit for 8GB of VRAM in our testing. It ran comfortably without pushing the GPU too hard, produced strong answers, and offered a good balance between model capability and hardware feasibility.

How do you avoid cold starts in Ollama?

Configure Ollama to keep the model loaded in memory between requests. If the model unloads and has to reload for each request, latency increases significantly. Persisting the model in VRAM makes the local setup behave like a steady service rather than a one-off experiment.

Is local LLM inference practical for production use?

For small-scale projects, internal tools, and research workloads, yes. A local Ollama node gives you predictable access, no per-request billing, and more privacy. Hosted providers still win on convenience and scale, but local inference on consumer hardware is no longer just a novelty.