qwen3 vs nomic: swapping embedding models with real numbers
This is a follow-up to the previous article, where I implemented Hybrid RAG in the chatbot on my website. RAG was working, tokens were being saved, but one problem kept nagging me: Russian queries were retrieving the wrong chunks.

The problem
After launching RAG, I checked how the bot responds to the question “Tell me about the Hybrid RAG article”. The bot answered: “Hybrid RAG: the best of Agile and Waterfall”. An article about token optimization had turned into a project management methodology.
I looked at the logs. The blog_hybrid_rag chunk with the correct text was sitting in Qdrant. But when searching for “Hybrid RAG token optimization” (in Russian), it landed at 44th place out of 48 with a score of 0.583. The first 7 positions were taken by random articles about IT transformations and BPM.
The problem was nomic-embed-text. The model is trained predominantly on English data. Russian text gets converted into a vector, but the semantics are lost: “token optimization” and “management evolution” look roughly the same to it.
Choosing a replacement
Criteria:
- Native support for Russian, English, and Kazakh
- Runs in Ollama (everything is local, no external APIs)
- Fits in 4 GB RAM on the server alongside Qdrant and the bot
qwen3-embedding was a perfect fit: #1 on the MTEB multilingual leaderboard, three sizes (0.6B / 4B / 8B), available in Ollama. I picked 0.6B - 639 MB, not much heavier than nomic (274 MB).
Migration
Swapping an embedding model is not just changing a name. The vector dimension changes (768 -> 1024), which means all Qdrant collections need to be recreated and data re-indexed.
I wrote the code with env vars from the start:
EMBED_MODEL = os.environ.get("EMBED_MODEL", "nomic-embed-text")
VECTOR_DIM = int(os.environ.get("VECTOR_DIM", "768"))
Added automatic collection recreation when the dimension changes:
def ensure_collection():
client = get_client()
collections = [c.name for c in client.get_collections().collections]
if COLLECTION in collections:
info = client.get_collection(COLLECTION)
existing_dim = info.config.params.vectors.size
if existing_dim != VECTOR_DIM:
logger.info("Dimension changed %d -> %d, recreating",
existing_dim, VECTOR_DIM)
client.delete_collection(COLLECTION)
else:
return
client.create_collection(
COLLECTION,
VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
)
Deploy:
# On the server
docker exec ollama ollama pull qwen3-embedding:0.6b
# In .env
EMBED_MODEL=qwen3-embedding:0.6b
VECTOR_DIM=1024
# Restart (CI/CD or manually)
docker compose up -d --build chatbot
In the logs on startup:
Vector dimension changed 768 -> 1024, recreating collection
Created Qdrant collection: knowledge (dim=1024)
Knowledge ingestion complete: 48/48 chunks
All 48 chunks re-indexed automatically. Downtime - a few seconds.
Benchmark
12 queries across three languages. For each query - the expected chunk and the actual rank/score from Qdrant. Same 48 chunks, same search_text, only the embedding model differs.
Russian
| Query | Expected chunk | nomic | qwen3 |
|---|---|---|---|
| Расскажи про опыт в банке | career_bank_head | #5 0.636 | #1 0.658 |
| опыт работы в крупном банке | career_bank_head | #1 0.715 | #1 0.683 |
| какие услуги предлагает | services_overview | #2 0.688 | #1 0.612 |
| где учился Ильяс | education_certs | #2 0.666 | #1 0.561 |
| Hybrid RAG оптимизация | blog_hybrid_rag | #44 0.583 | #1 0.719 |
| как устроен чатбот | blog_hybrid_rag | #14 0.613 | #5 0.636 |
| где работал Ильяс | career_overview | #11 0.661 | #1 0.663 |
| статья про 4 бага | blog_four_bugs | #19 0.599 | #3 0.553 |
The key row: “Hybrid RAG optimization” (in Russian). nomic ranked the target chunk 44th. qwen3 ranked it 1st.
English
| Query | Expected chunk | nomic | qwen3 |
|---|---|---|---|
| What services does Ilyas offer | services_overview | #1 0.719 | #1 0.825 |
| banking career experience | career_bank_head | #2 0.589 | #1 0.733 |
| AI agents and LLM | service_ai_agents | #1 0.715 | #1 0.823 |
In English, both models find the right chunk. But qwen3 gives scores of 0.82-0.83 vs nomic’s 0.59-0.72. Higher score means more distance from noise, less chance of mixing in an irrelevant chunk.
Kazakh
| Query | Expected chunk | nomic | qwen3 |
|---|---|---|---|
| Ильястың банктегі тәжірибесі | career_bank_head | #14 0.575 | #6 0.503 |
Kazakh remains the hardest language for both models. But qwen3 at least lands in the top-7 (our retrieve_top_k), while nomic does not.
Summary
nomic found the target chunk within the retrieval window 42% of the time. qwen3 - 100%.
Speed
Per-query latency is nearly identical, qwen3 is even slightly faster. Ingestion is slower (9.5 vs 5.4 sec), but that is a one-time operation on container startup.
Why nomic loses on Russian
nomic-embed-text is trained on English data. It “understands” Russian words but cannot distinguish their semantics. For it, “token optimization” and “management evolution” (in Russian) are roughly the same thing: a set of Cyrillic characters with a similar structure.
qwen3-embedding is trained on 100+ languages, including Russian. It understands that “token optimization” (in Russian) is closer to “token cost reduction” than to “project management evolution”.
Visually, it looks like this: nomic gives scores of 0.58-0.72 for all 48 chunks - a narrow corridor where the signal is lost in noise. qwen3 gives 0.50-0.83 - wider spread, the target chunk stands out clearly.
What was not measured
This benchmark is not MTEB. 48 chunks, 12 queries, one use case. The results show a specific improvement for a specific product: a multilingual chatbot on a personal website handling Russian, English, and Kazakh queries.
For a purely English RAG, nomic may be sufficient. For any project with Russian or other non-English languages, qwen3-embedding is the clear winner.
How to try it yourself
If you already have a RAG setup on Ollama + Qdrant (or you built one following the previous article):
# Pull the model
docker exec ollama ollama pull qwen3-embedding:0.6b
# Add to .env
EMBED_MODEL=qwen3-embedding:0.6b
VECTOR_DIM=1024
# Restart
docker compose up -d --build
If your code recreates the collection when the dimension changes, data will be re-indexed automatically. If not, delete the collection manually via the Qdrant API.
The bot is running on qwen3-embedding right now - chat button in the bottom right corner. Try asking something in Russian and compare with how it worked before. Or get in touch if you want a similar system for your product.


