Comparison Between Models.
Let's compare between models registered in Ollama
I plan to build all projects using ChatOllama(). So, It would be useful to note the VRAM usage of the models registered in Ollama. In this post, I am writing to keep a record of the usage history when using different models. This will be continuously updated, so please refer to it.
Comparison Between Models
Attribute | Model | Param | Size (GB) | VRAM Usage (GB) | agent tools | vison | thinking | architecture | quantization |
---|---|---|---|---|---|---|---|---|---|
General | Llama-3.2-3B-Instruct | - | - | 9.6 | - | - | - | - | - |
General | gemma-2-2b-it | - | - | 9.8 | - | - | - | - | - |
General | polyglot-ko-1.3b | - | - | 6.1 | - | - | - | - | - |
Ollama | llama3.1:latest | 8b | 4.9 | 6.0 | ● | - | - | llama | Q4_K_M |
Ollama | llama3.2:latest | 3b | 2.0 | 3.2 | ● | - | - | llama | Q4_K_M |
Ollama | llama3.2-vision:latest | 11b | 6.0 | 10.0 | - | ● | - | llama | Q4_K_M |
Ollama | gemma2:latest | 9b | 5.4 | 8.4 | - | - | - | gemma2 | Q4_0 |
Ollama | gemma3:latest | 4b | 3.3 | 4.5 | - | ● | - | gemma3 | Q4_K_M |
Ollama | PetrosStav/gemma3-tools:12b | 12b | 8.1 | 7.0 | - | ● | - | gemma3 | Q4_K_M |
Ollama | 12b | 8.1 | 7.0 | - | ● | - | gemma3 | Q4_K_M | |
Ollama | deepseek-r1:14b | 14b | 9.0 | 8.9 | ● | - | ● | qwen2 | Q4_K_M |
Ollama | deepseek-r1:7b | 7b | 4.7 | 5.2 | ● | - | ● | qwen2 | Q4_K_M |
Ollama | qwen3:latest | 8b | 5.2 | 6.4 | ● | - | ● | qwen3 | Q4_K_M |
Ollama | qwen3:14b | 14b | 9.3 | 10 | ● | - | ● | qwen3 | Q4_K_M |
You can check key information about a model using the ollama show
command. The quantization details are especially important.
As the use of Large Language Models (LLMs) rapidly increases, model compression and optimization have become critical. In particular, in edge environments with limited server resources or services that require real-time responses, it’s often not feasible to use large-scale models as-is. In such cases, quantization is a technique that can significantly reduce both inference speed and memory usage, while preserving as much model accuracy as possible. I’ll update this post with further insights after trying it out myself.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
PS C:\Users\ycjang> ollama show gemma3:12b
Model
architecture gemma3
parameters 12.2B
context length 8192
embedding length 3840
quantization Q4_K_M
Parameters
stop "<end_of_turn>"
temperature 0.1
License
Gemma Terms of Use
Last modified: February 21, 2024
PS C:\Users\ycjang>