Skip to content
imarch.dev
Back to blog
· 7 min read

qwen3 vs nomic: swapping embedding models with real numbers

ai RAG architecture product

This is a follow-up to the previous article, where I implemented Hybrid RAG in the chatbot on my website. RAG was working, tokens were being saved, but one problem kept nagging me: Russian queries were retrieving the wrong chunks.

qwen3 vs nomic: swapping embedding models with real numbers

The problem

After launching RAG, I checked how the bot responds to the question “Tell me about the Hybrid RAG article”. The bot answered: “Hybrid RAG: the best of Agile and Waterfall”. An article about token optimization had turned into a project management methodology.

I looked at the logs. The blog_hybrid_rag chunk with the correct text was sitting in Qdrant. But when searching for “Hybrid RAG token optimization” (in Russian), it landed at 44th place out of 48 with a score of 0.583. The first 7 positions were taken by random articles about IT transformations and BPM.

The problem was nomic-embed-text. The model is trained predominantly on English data. Russian text gets converted into a vector, but the semantics are lost: “token optimization” and “management evolution” look roughly the same to it.

Choosing a replacement

Criteria:

  • Native support for Russian, English, and Kazakh
  • Runs in Ollama (everything is local, no external APIs)
  • Fits in 4 GB RAM on the server alongside Qdrant and the bot

qwen3-embedding was a perfect fit: #1 on the MTEB multilingual leaderboard, three sizes (0.6B / 4B / 8B), available in Ollama. I picked 0.6B - 639 MB, not much heavier than nomic (274 MB).

nomic-embed-text274 MB · 768 dim
qwen3-embedding:0.6b639 MB · 1024 dim
qwen3-embedding:4b2.5 GB · 2560 dim
qwen3-embedding:8b4.7 GB · 4096 dim

Migration

Swapping an embedding model is not just changing a name. The vector dimension changes (768 -> 1024), which means all Qdrant collections need to be recreated and data re-indexed.

I wrote the code with env vars from the start:

EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
VECTOR_DIM = int(os.environ.get("VECTOR_DIM", "768"))

Added automatic collection recreation when the dimension changes:

def ensure_collection():
    client = get_client()
    collections = [c.name for c in client.get_collections().collections]
    if COLLECTION in collections:
        info = client.get_collection(COLLECTION)
        existing_dim = info.config.params.vectors.size
        if existing_dim != VECTOR_DIM:
            logger.info("Dimension changed %d -> %d, recreating",
                        existing_dim, VECTOR_DIM)
            client.delete_collection(COLLECTION)
        else:
            return
    client.create_collection(
        COLLECTION,
        VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
    )

Deploy:

# On the server
docker exec ollama ollama pull qwen3-embedding:0.6b

# In .env
EMBED_MODEL=qwen3-embedding:0.6b
VECTOR_DIM=1024

# Restart (CI/CD or manually)
docker compose up -d --build chatbot

In the logs on startup:

Vector dimension changed 768 -> 1024, recreating collection
Created Qdrant collection: knowledge (dim=1024)
Knowledge ingestion complete: 48/48 chunks

All 48 chunks re-indexed automatically. Downtime - a few seconds.

Benchmark

12 queries across three languages. For each query - the expected chunk and the actual rank/score from Qdrant. Same 48 chunks, same search_text, only the embedding model differs.

Russian

Query Expected chunk nomic qwen3
Расскажи про опыт в банке career_bank_head #5 0.636 #1 0.658
опыт работы в крупном банке career_bank_head #1 0.715 #1 0.683
какие услуги предлагает services_overview #2 0.688 #1 0.612
где учился Ильяс education_certs #2 0.666 #1 0.561
Hybrid RAG оптимизация blog_hybrid_rag #44 0.583 #1 0.719
как устроен чатбот blog_hybrid_rag #14 0.613 #5 0.636
где работал Ильяс career_overview #11 0.661 #1 0.663
статья про 4 бага blog_four_bugs #19 0.599 #3 0.553

The key row: “Hybrid RAG optimization” (in Russian). nomic ranked the target chunk 44th. qwen3 ranked it 1st.

English

Query Expected chunk nomic qwen3
What services does Ilyas offer services_overview #1 0.719 #1 0.825
banking career experience career_bank_head #2 0.589 #1 0.733
AI agents and LLM service_ai_agents #1 0.715 #1 0.823

In English, both models find the right chunk. But qwen3 gives scores of 0.82-0.83 vs nomic’s 0.59-0.72. Higher score means more distance from noise, less chance of mixing in an irrelevant chunk.

Kazakh

Query Expected chunk nomic qwen3
Ильястың банктегі тәжірибесі career_bank_head #14 0.575 #6 0.503

Kazakh remains the hardest language for both models. But qwen3 at least lands in the top-7 (our retrieve_top_k), while nomic does not.

Summary

nomic: chunk in top-13 out of 12
qwen3: chunk in top-19 out of 12
nomic: chunk in top-7 (retrieval window)5 out of 12
qwen3: chunk in top-7 (retrieval window)12 out of 12

nomic found the target chunk within the retrieval window 42% of the time. qwen3 - 100%.

Speed

Embedding latency (average over 5 queries)
nomic-embed-text144 ms
qwen3-embedding:0.6b134 ms
Ingestion (48 chunks)
nomic-embed-text5.4 sec
qwen3-embedding:0.6b9.5 sec
Model size
nomic-embed-text274 MB
qwen3-embedding:0.6b639 MB

Per-query latency is nearly identical, qwen3 is even slightly faster. Ingestion is slower (9.5 vs 5.4 sec), but that is a one-time operation on container startup.

Why nomic loses on Russian

nomic-embed-text is trained on English data. It “understands” Russian words but cannot distinguish their semantics. For it, “token optimization” and “management evolution” (in Russian) are roughly the same thing: a set of Cyrillic characters with a similar structure.

qwen3-embedding is trained on 100+ languages, including Russian. It understands that “token optimization” (in Russian) is closer to “token cost reduction” than to “project management evolution”.

Visually, it looks like this: nomic gives scores of 0.58-0.72 for all 48 chunks - a narrow corridor where the signal is lost in noise. qwen3 gives 0.50-0.83 - wider spread, the target chunk stands out clearly.

What was not measured

This benchmark is not MTEB. 48 chunks, 12 queries, one use case. The results show a specific improvement for a specific product: a multilingual chatbot on a personal website handling Russian, English, and Kazakh queries.

For a purely English RAG, nomic may be sufficient. For any project with Russian or other non-English languages, qwen3-embedding is the clear winner.

How to try it yourself

If you already have a RAG setup on Ollama + Qdrant (or you built one following the previous article):

# Pull the model
docker exec ollama ollama pull qwen3-embedding:0.6b

# Add to .env
EMBED_MODEL=qwen3-embedding:0.6b
VECTOR_DIM=1024

# Restart
docker compose up -d --build

If your code recreates the collection when the dimension changes, data will be re-indexed automatically. If not, delete the collection manually via the Qdrant API.


The bot is running on qwen3-embedding right now - chat button in the bottom right corner. Try asking something in Russian and compare with how it worked before. Or get in touch if you want a similar system for your product.

Share:

Related posts