This is a follow-up to the previous article, where I implemented Hybrid RAG in the chatbot on my website. RAG was working, tokens were being saved, but one problem kept nagging me: Russian queries were retrieving the wrong chunks.

qwen3 vs nomic: swapping embedding models with real numbers

The problem

After launching RAG, I checked how the bot responds to the question “Tell me about the Hybrid RAG article”. The bot answered: “Hybrid RAG: the best of Agile and Waterfall”. An article about token optimization had turned into a project management methodology.

I looked at the logs. The blog_hybrid_rag chunk with the correct text was sitting in Qdrant. But when searching for “Hybrid RAG token optimization” (in Russian), it landed at 44th place out of 48 with a score of 0.583. The first 7 positions were taken by random articles about IT transformations and BPM.

The problem was nomic-embed-text. The model is trained predominantly on English data. Russian text gets converted into a vector, but the semantics are lost: “token optimization” and “management evolution” look roughly the same to it.

Choosing a replacement

Criteria:

Native support for Russian, English, and Kazakh
Runs in Ollama (everything is local, no external APIs)
Fits in 4 GB RAM on the server alongside Qdrant and the bot

qwen3-embedding was a perfect fit: #1 on the MTEB multilingual leaderboard, three sizes (0.6B / 4B / 8B), available in Ollama. I picked 0.6B - 639 MB, not much heavier than nomic (274 MB).

nomic-embed-text274 MB · 768 dim
qwen3-embedding:0.6b639 MB · 1024 dim
qwen3-embedding:4b2.5 GB · 2560 dim
qwen3-embedding:8b4.7 GB · 4096 dim

Migration

Swapping an embedding model is not just changing a name. The vector dimension changes (768 -> 1024), which means all Qdrant collections need to be recreated and data re-indexed.

I wrote the code with env vars from the start:

EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
VECTOR_DIM = int(os.environ.get("VECTOR_DIM", "768"))

Added automatic collection recreation when the dimension changes:

def ensure_collection():
    client = get_client()
    collections = [c.name for c in client.get_collections().collections]
    if COLLECTION in collections:
        info = client.get_collection(COLLECTION)
        existing_dim = info.config.params.vectors.size
        if existing_dim != VECTOR_DIM:
            logger.info("Dimension changed %d -> %d, recreating",
                        existing_dim, VECTOR_DIM)
            client.delete_collection(COLLECTION)
        else:
            return
    client.create_collection(
        COLLECTION,
        VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
    )

Deploy:

# On the server
docker exec ollama ollama pull qwen3-embedding:0.6b

# In .env
EMBED_MODEL=qwen3-embedding:0.6b
VECTOR_DIM=1024

# Restart (CI/CD or manually)
docker compose up -d --build chatbot

In the logs on startup:

Vector dimension changed 768 -> 1024, recreating collection
Created Qdrant collection: knowledge (dim=1024)
Knowledge ingestion complete: 48/48 chunks

All 48 chunks re-indexed automatically. Downtime - a few seconds.

Benchmark

12 queries across three languages. For each query - the expected chunk and the actual rank/score from Qdrant. Same 48 chunks, same search_text, only the embedding model differs.

Russian

Query	Expected chunk	nomic	qwen3
Расскажи про опыт в банке	career_bank_head	#5 0.636	#1 0.658
опыт работы в крупном банке	career_bank_head	#1 0.715	#1 0.683
какие услуги предлагает	services_overview	#2 0.688	#1 0.612
где учился Ильяс	education_certs	#2 0.666	#1 0.561
Hybrid RAG оптимизация	blog_hybrid_rag	#44 0.583	#1 0.719
как устроен чатбот	blog_hybrid_rag	#14 0.613	#5 0.636
где работал Ильяс	career_overview	#11 0.661	#1 0.663
статья про 4 бага	blog_four_bugs	#19 0.599	#3 0.553

The key row: “Hybrid RAG optimization” (in Russian). nomic ranked the target chunk 44th. qwen3 ranked it 1st.

English

Query	Expected chunk	nomic	qwen3
What services does Ilyas offer	services_overview	#1 0.719	#1 0.825
banking career experience	career_bank_head	#2 0.589	#1 0.733
AI agents and LLM	service_ai_agents	#1 0.715	#1 0.823

In English, both models find the right chunk. But qwen3 gives scores of 0.82-0.83 vs nomic’s 0.59-0.72. Higher score means more distance from noise, less chance of mixing in an irrelevant chunk.

Kazakh

Query	Expected chunk	nomic	qwen3
Ильястың банктегі тәжірибесі	career_bank_head	#14 0.575	#6 0.503

Kazakh remains the hardest language for both models. But qwen3 at least lands in the top-7 (our retrieve_top_k), while nomic does not.

Summary

nomic: chunk in top-13 out of 12

qwen3: chunk in top-19 out of 12

nomic: chunk in top-7 (retrieval window)5 out of 12

qwen3: chunk in top-7 (retrieval window)12 out of 12

nomic found the target chunk within the retrieval window 42% of the time. qwen3 - 100%.

Speed

Embedding latency (average over 5 queries)

nomic-embed-text144 ms

qwen3-embedding:0.6b134 ms

Ingestion (48 chunks)

nomic-embed-text5.4 sec

qwen3-embedding:0.6b9.5 sec

Model size

nomic-embed-text274 MB

qwen3-embedding:0.6b639 MB

Per-query latency is nearly identical, qwen3 is even slightly faster. Ingestion is slower (9.5 vs 5.4 sec), but that is a one-time operation on container startup.

Why nomic loses on Russian

nomic-embed-text is trained on English data. It “understands” Russian words but cannot distinguish their semantics. For it, “token optimization” and “management evolution” (in Russian) are roughly the same thing: a set of Cyrillic characters with a similar structure.

qwen3-embedding is trained on 100+ languages, including Russian. It understands that “token optimization” (in Russian) is closer to “token cost reduction” than to “project management evolution”.

Visually, it looks like this: nomic gives scores of 0.58-0.72 for all 48 chunks - a narrow corridor where the signal is lost in noise. qwen3 gives 0.50-0.83 - wider spread, the target chunk stands out clearly.

What was not measured

This benchmark is not MTEB. 48 chunks, 12 queries, one use case. The results show a specific improvement for a specific product: a multilingual chatbot on a personal website handling Russian, English, and Kazakh queries.

For a purely English RAG, nomic may be sufficient. For any project with Russian or other non-English languages, qwen3-embedding is the clear winner.

How to try it yourself

If you already have a RAG setup on Ollama + Qdrant (or you built one following the previous article):

# Pull the model
docker exec ollama ollama pull qwen3-embedding:0.6b

# Add to .env
EMBED_MODEL=qwen3-embedding:0.6b
VECTOR_DIM=1024

# Restart
docker compose up -d --build

If your code recreates the collection when the dimension changes, data will be re-indexed automatically. If not, delete the collection manually via the Qdrant API.

The bot is running on qwen3-embedding right now - chat button in the bottom right corner. Try asking something in Russian and compare with how it worked before. Or get in touch if you want a similar system for your product.