How fast is leap 16?

How fast is leap 16 compared to slowroll?

Next week, I will be installing a new laptop and I need speed. It’s for running ollama and gemma3 27b quantization 8.

The laptop will be: (no inxi, no laptop yet)

  • intel core ultra 9
  • 32gb ram
  • nvidia 4060 8gb
  • 1tb + 2tb = 3tb ssd

Right now, on a slightly slower laptop, running my “standard” ollama query takes between 17 and 22 minutes to run and the temp cpu is around 95°C and gpu is around 55°C.

Ollama reports 75% cpu and 25% gpu.

Would leap 16 run faster than slowroll or the other way around?

Thanks

@elfroggio That setup should be rocking on the 4060 at 100%, sounds like you have some misconfigured hardware. I believe @hendersj is using Ollama on a similar device. My experience on Leap 15 with Ollama (k3s) and using Nvidia container rocked the Nvidia Tesla P4 at full speed RAM and Cuda cores.

Yeah, I’m on TW rather than Leap, but running ollama with a 3090ti with 24gb of vram.

I’m not using that particular model, though - what’s the name of the model you pulled into ollama?

The issue may be that it’s too big for the vram - you can use nvtop to see what the GPU memory utilization looks like. I use the non-quantized model, and that one is 17 gb in size (the 27b-it-qat model is 18 gb):

$ ollama ls
NAME                   ID              SIZE      MODIFIED   
gemma3:27b             a418f5838eaf    17 GB     8 days ago    
gemma3:1b              8648f39daa8f    815 MB    8 days ago    
gemma3:12b             f4031aab637d    8.1 GB    8 days ago    
gemma3:27b-it-qat      29eb0b9aeda3    18 GB     8 days ago    

Yes I know that the . My current video card is also an 8Gb (it’s another laptop). As a laptop, I don’t get to chose the video card or the amount of the ram of the gpu.

I use gemma3:27b-it-q8_0 which 30Gb. So most of it is run as cpu at the speed of 5 tokens/sec.

I’ve tried other llms and other quantizations of the gemmas but gemma3:27b-it-q8_0 gives me the best results for my usage.

I use it for writing and I have very extensive instructions: persona, tasks, audience, constraints, and output format.

So, since I’m getting this new laptop and was going to install opensuse, I was wondering which would be faster. Leap 15.6 was compiled with an old version of gcc and now leap 16 is with gcc13. Isn’t tumbleweed compiled with gcc14?

As I understand it, it’s either all or nothing. When I run out of vram (for example, if I try to run two models at the same time), the performance tanks.

You’re entirely CPU-bound here, even though ollama says it’s using some of the GPU (which would really surprise me).

You’re not going to see a significant difference regardless of which Linux distro you use - CPU-bound is going to always be pretty slow with a model that large - there really isn’t a way around that in my experience.

You’ll best be served, if you’re getting a new laptop, to either see if you can find one with a better GPU (ie, more vram) or look for one with a faster CPU - but CPU performance vs. GPU performance is going to be staggeringly different. I have an i9-10980CE with 36 threads, and the performance of ollama when it’s CPU-bound is still quite slow compared to running on the GPU (not quite 17-22 minutes per query, but it’s definitely multiple minutes vs. seconds).

Slowroll (which is based on TW) uses a newer kernel, so any performance improvements from the kernel will be in TW first, then in Slowroll, and then eventually will make their way to Leap.

You might shave 1-2 minutes off the average query time against this LLM with a better CPU and different compiler optimizations. Maybe. I don’t expect that you’ll see a noticible difference given that the workload you’re trying to run isn’t optimized for running on the CPU.

1 Like

I do think you will not find any significant leap 16 versus slowroll unless it is GPU related.

I tried:

$ ollama run --verbose gemma3:27b-it-q8_0
>>> How fast will gemma3:27b-it-q8_0 on a i7-9700K with 64GB RAM?

It has very verbose output but at the end it gives:

<snip>
**In summary:** Your i7-9700K and 64GB of RAM are capable of running Gemma 3B 27B-it-q8_0, 
and you should see reasonable performance (10-25 t/s). The exact speed will depend on your 
software setup and the task. Be prepared to experiment with different settings to optimize performance.

Let me know if you'd like more detailed instructions on setting up llama.cpp or Ollama, or if you have any other questions!

total duration:       27m7.26034626s
load duration:        584.455373ms
prompt eval count:    43 token(s)
prompt eval duration: 5.529993241s
prompt eval rate:     7.78 tokens/s
eval count:           1429 token(s)
eval duration:        27m1.145177416s
eval rate:            0.88 tokens/s

So while it give 10-25 t/s I get 0.88 t/s.

I am reasonably happy with deepseek-coder-v2:latest running locally.

@elfroggio and is resizable bar set in the BIOS? You can also check with journalctl -b | grep Resizable

Have you looked at the /etc/sysconfig/ollama environment variables?

This is from last year with a k3s setup…




As I thought more about this last night, some additional thoughts are that what the kernel is compiled with is going to have minimal impact on application performance; the two ways in which it might affect anything are in how the kernel handles threading/multitasking, and the points at which ollama (or any app) interfaces with the hardware.

I said it before, but it bears repeating: The bottleneck isn’t the kernel - it’s the hardware. Kernel optimization isn’t going to make a lot of difference.

Yes, I realize that but a 5% kernel improvement would result in at least one minute saved. Doesn’t look like much. but…

I never just do a single query. Then I ask questions and optimize. Could be 3 times. Once it was 10 times… and it adds up. That’s why I was asking the question.

To afford a desktop with an A100, I would have to sell both kidneys…

sudo journalctl -b | grep Resizable

nothing. My current laptop, a dell, does not have this bios setting.

I have disabled the service. I use the command line:

OLLAMA_CONTEXT_LENGTH=8192 ollama serve

With the new laptop, I will be making some model files.

Then I’d go with the latest possible kernel for the most possible kernel optimization.

But I’d probably also look to do a custom build of ollama that uses those optimizations as well. Or I’d look to a hosted solution, renting the equipment for what I needed and only what I needed.

So I have done a setup on k3s/nvidia/ollama/open-webui on Leap 16.0 smaller models run fine eg llama3 and serve up responses just using the GPU, I’m running the gemma3:27b-it-q8_0 one and it’s very random in GPU usage (maxes out at 32%), more core bound (only 8 cores)… uses 6.5GiB of 7.5GiB and needs more than 32G of RAM…

My needs are simple, so for me the kubernetes solution works great…

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.