From $32,000 to $0 with Small Models and CTranslate2

machine-learning

optimization

translation

ctranslate2

cost-optimization

How I reduced translation costs from $32,000 to $0 using small models and CTranslate2 optimization, achieving 4-6x speed improvements

Author

kareem

Published

October 3, 2025

Going from $32,000 to 0 cost with small models

It’s Friday and I have some time to continue working on open source tasks. I had an idea that requires translating a dataset. It’s not large - 2GB - and consists of some paragraphs from English. The average number of words is around 400 tokens per row from the 8 million rows :)

I looked into the OpenAI models to calculate how much this would cost me to translate all these with GPT-4.1 and GPT-4.1-mini, and the prices are the following: my total tokens => 8,000,000 rows * 400 tokens = 3,200,000,000. For the following calculations, I will assume input and output tokens are the same for ease of calculation. I will use the normal API, not the Batched API:

total input tokens => 3,200,000,000
total output tokens => 3,200,000,000

The calculation based on this date 2025-10-03

Cost	GPT4.1	GPT4.1-mini	GPT3.5 Turbo
Input tokens	3,200 * 2	3,200 * 0.40	3,200 * 0.50
Output tokens	3,200 * 8	3,200 * 1.60	3,200 * 1.50

Cost	GPT4.1	GPT4.1-mini	GPT3.5 Turbo
Input tokens	6400	1280	1600
Output tokens	25600	5120	4800
Total $ cost	32000	6400	6400
Total EGP cost	1,527,680	305,536	305,536
it’s interesting that the cost of gpt4.1 mini is the same as gpt3.5 Turbo ^ ^

Also, this is insane - I can afford only 1000 EGP or $50 at max :)

How poor am I? Very GPU poor! 💻

I wish I had a local GPU to be able to test open source LLMs like Cohere 70B and such strong models, and I would not care about time!

I remember there is amazing work that is not even an LLM for translation, created by Ahmed Wasfy from NAMMA Community

It’s a 240M parameter small model trained to translate English into Egyptian. It was trained on more than 150,000 rows with more than 10 Million tokens for Arabic language. It competes with closed LLMs like GPT-4o and Claude-3.5-Sonnet

I tested it with my local laptop GPU 1660Ti GTX mobile version with 6GB and CPU is Core i7 gen9 from Intel with 32GB DDR4. I tested the model and it worked with PyTorch very fast and very efficiently! The translations are very similar and this is enough for the task I want to build on this translated dataset! Let’s do the math for how much this will cost on my laptop :)

PyTorch Pipeline with HF Inference through Transformers

I created the translation pipeline and tested it with these batch sizes with these settings: float16 and batch sizes = [2,4,6,8,16]. I found that even though I have enough memory to load more batches, the optimal batch size was 4 and this is because of the RAM and CPU power. I was able to process 100 examples with these results:

Method	Number of Examples	Batch Size	GPU setup	Full time	Days need
Pytorch + float16	100	8	1660TI laptop GPU	60s	55.56 Days
Pytorch + float16	100	4	1660TI laptop GPU	45s	41.67 Days
Pytorch + float16 + optimized version	100	4	1660TI laptop GPU	35s	32.41 Days

optimized version here i mean

model = torch.compile(model, mode="max-autotune", fullgraph=True)

model = model.eval()

It’s a new feature in PyTorch 2 to accelerate the model speeds. I tried multiple settings and also the different backends but there is no huge difference. Convert to ONNX? It’s a nightmare. I did it before and the results didn’t worth the headache! I may try it when I have time.

Let’s Do Quantization

I tried to use the torchao which enables performing quantization in more stable and easier ways. I tried it a lot with my GPU but due to CUDA version it always throws this error:

1. AssertionError: Float8 dynamic activation quantization is only supported on CUDA>=8.9 and MI300+

The library code is not straightforward and I don’t have time - I have just 2 days to finish this before I return to my main work.

I also faced some errors because the model I am trying to use is an old and not optimized one for these methods. It’s based on the OPUS-MT-en-ar created with marianNMT which is an efficient NMT implementation written in pure C++. The models have been converted to PyTorch using the Transformers library by Hugging Face

VLLMs and SGLang for HF translation pipeline

I searched for SGLang solution with the model and I didn’t find any help. I tried the vLLM documentation also and found the following pages: bring_your_own_model and this. The model speed was worse than normal HF tensors - it was 5 seconds more. I think there is a better way to write the vLLM version better than mine.

More search and Ctranslate magic 🎩

I wanted to give up, but let’s try a final search on how to serve Marian model. In an old forum answer I found what is called CTranslate2. They say it’s faster than HF Transformers for specific architectures by around 4-6x.

Definition: CTranslate2 is a C++ and Python library for efficient inference with Transformer models. The following model types are currently supported:

Encoder-decoder models: Transformer base/big, M2M-100, NLLB, BART, mBART, Pegasus, T5, Whisper
Decoder-only models: GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT, Llama, Mistral, Gemma, CodeGen, GPTBigCode, Falcon, Qwen2
Encoder-only models: BERT, DistilBERT, XLM-RoBERTa

Compatible models should be first converted into an optimized model format. The library includes converters for multiple frameworks:

Key features of Ctranslate

Fast and efficient execution on CPU and GPU
The execution is significantly faster and requires less resources than general-purpose deep learning frameworks on supported models and tasks thanks to many advanced optimizations: layer fusion, padding removal, batch reordering, in-place operations, caching mechanism, etc.

Quantization and reduced precision
The model serialization and computation support weights with reduced precision: 16-bit floating points (FP16), 16-bit brain floating points (BF16), 16-bit integers (INT16), 8-bit integers (INT8) and AWQ quantization (INT4).
Multiple CPU architectures support
The project supports x86-64 and AArch64/ARM64 processors and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.
Automatic CPU detection and code dispatch
One binary can include multiple backends (e.g. Intel MKL and oneDNN) and instruction set architectures (e.g. AVX, AVX2) that are automatically selected at runtime based on the CPU information.
Parallel and asynchronous execution
Multiple batches can be processed in parallel and asynchronously using multiple GPUs or CPU cores.
Dynamic memory usage
The memory usage changes dynamically depending on the request size while still meeting performance requirements thanks to caching allocators on both CPU and GPU.
Lightweight on disk
Quantization can make the models 4 times smaller on disk with minimal accuracy loss.
Simple integration
The project has few dependencies and exposes simple APIs in Python and C++ to cover most integration needs.
Configurable and interactive decoding
Advanced decoding features allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence.
Support tensor parallelism for distributed inference
Very large model can be split into multiple GPUs. Following this documentation to set up the required environment.

Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project.

Let’s try it!

I used the following script to convert the HF version into CTranslate2 expected format:

ct2-transformers-converter --model NAMAA-Space/masrawy-english-to-egyptian-arabic-translator-v2.9 --output_dir ct2_model_masrawy

then i used the model with this version

translator = ctranslate2.Translator(
    "ct2_model_masrawy",
    device="cuda",
    compute_type="float16",
)

It worked and was very fast - much, much faster!

Method	Number of Examples	Batch Size	GPU setup	Full time	Days need
Ctranslate + float16	100	8	1660TI laptop GPU	11s	10.19 days
Ctranslate + float16	100	4	1660TI laptop GPU	16s	12.04 days
Ctranslate	100	6	1660TI laptop GPU	13s	14.81 days

We moved from 32.4 days to 10 days!!!

Getting help from the Titan RTX 24GB

One of my friends offered me access to his workstation which has dual GPU Titan RTX. It’s an old GPU but it’s far better than my little kobo.

I found that the optimal batch is 64 after some tries. Let’s use this and see the results. I will also increase the size from 100 samples to 1000.

Method	Number of Examples	Batch Size	GPU setup	Full time	Days need
Pytorch + float16 + optimized version	1000	64	Titan RTX	240s	22.22 days
Ctranslate + int8	1000	64	Titan RTX	7.89s	0.73 days
Ctranslate + int8 + dual GPU	1000	64	Dual Titan RTX	4.42s	0.41 days
Ctranslate + float16	1000	64	Titan RTX	6.46s	0.60 days
Ctranslate + float16 + dual GPU	1000	64	Dual Titan RTX	3.72	0.34 days
Ctranslate + int8_float16	1000	64	Titan RTX	6.81s	0.63 days
Ctranslate + int8_float16 + dual GPU	1000	64	Dual Titan RTX	3.85s	0.36 days
==We moved now from 22 days in single Titan RTX to 0.60 days!==

Why float16 is faster than int8

Float16 Version (faster!):

Dual GPU: 269.2 docs/sec
Single GPU: 154.7 docs/sec
Time for 1000 docs: 3.72 seconds

Int8 Version (slower):

Dual GPU: 225.8 docs/sec
Single GPU: 126.7 docs/sec
Time for 1000 docs: 4.43 seconds

Result: Float16 is ~19% faster! (269.2 vs 225.8 docs/sec) Lower precision ≠ Always faster! I need to increase the batch size for int8 and see which batch size will be better!

Next steps!

This is just the start. I will search more and investigate how to make this faster because the 8 million rows are only 2GB and the next task is to translate 500GB :) Every second will make a huge difference!

Use larger batch size with int8
Use different GPU with modern architecture and better CPU
Deep dive into vLLM
Try the ONNX version for GPU, not CPU
Try again with torchao

Thanks for your time! Here is the converted version on Hugging Face: ctranslate_masrawy. Small models can save your life ^ ^