My Dream Job Working at Tarteel & muaalem El Quran | Part 2
The Journey Continues | Tarteel AI career
Last year, I wrote about my dream to work at Tarteel.
It was a heartfelt post about my passion for Quranic technology. No one from Tarteel reached out :) which was expected.
Feeling passionate isn’t enough; you must be a strong engineer and demonstrate your skills so companies see the value in hiring you.
Talking is easy. Acting is hard.
Some people messaged me on LinkedIn saying they shared the same feelings. Others mentioned they knew people at Tarteel and offered to help.
But honestly, I didn’t pursue this dream as actively as I should have.
Instead, I joined xbites, a real estate AI company where I learned invaluable lessons about building agentic systems, creating production-ready models, and meeting business requirements.
We had hard times and a lot of fun.
This is my last week at xbites — I’m moving to another opportunity. But what about Tarteel?
muaalem El Quran: A New Chapter
I’m building a habit of daily TIL (Today I Learned) posts to improve my learning and share knowledge with others. This blog is part of that journey.
I’m excited to share my work on muaalem El Quran , an open-source Quranic recitation model developed by Abdalla Amal.
It’s designed as an alternative to Tarteel for the developer community, enabling anyone to build their own solutions on top of it.
We have an ambitious roadmap ahead. I joined the team as an open-source contributor focusing on:
AI model optimization
Inference acceleration
Deployment infrastructure
El muaalem Before Ramadan
We have two months before Ramadan and several deliverables for both business partners and open-source projects like Quran Foundation.
The Challenge: How Should We Serve the Model?
I approached this systematically:
Model Optimization — Make inference faster
Preprocessing Optimization — Efficient audio processing
Serving & Batching — Production-ready API
Model Optimization (Wav2Vec2Bert)
Starting Point
| Metric | Value |
|---|---|
| Model | Wav2Vec2BertForMultilevelCTC |
| Architecture | Custom audio model with 11 output heads |
| Parameters | 605 million |
| Baseline RTF | 0.0303 |
| Test GPU | NVIDIA GTX 1660 Ti (6GB VRAM) |
What is RTF? Real-Time Factor = Processing Time ÷ Audio Duration. An RTF of 0.03 means 1 second of audio processes in 30ms. Lower is better.
The model has 11 output heads predicting different Tajweed attributes:
Phonemes (Arabic sounds)
Ghonna (nasalization)
Qalqala (echoing sounds)
Tafkheem/Tarqeeq (heavy/light letters)
And 7 more…
Optimization 1: PyTorch torch.compile
PyTorch 2.0 introduced torch.compile() a one-line optimization that can sometimes achieve 2x speedups through graph optimization and kernel fusion.
model = torch.compile(model)Result: RTF 0.0303 — No improvement
The GTX 1660 Ti lacks support, limiting the benefits of torch.compile’s optimizations.
Optimization 2: ONNX Runtime
ONNX (Open Neural Network Exchange) allows models to run on optimized inference engines. We exported the PyTorch model to ONNX format and tested different execution providers.
CUDA Execution Provider:
session = ort.InferenceSession(
"model.onnx",
providers=["CUDAExecutionProvider"]
)Result: RTF 0.0263 — 1.15x faster
Graph Optimizations: We applied ONNX Runtime’s graph optimizer which fused 97 SkipLayerNormalization operations.
Result: RTF 0.0258 — 1.17x faster
FP16 Quantization: We attempted FP16 (half precision) to reduce memory and increase speed.
Result: RTF 0.1170 — 4x SLOWER!
Without Tensor Cores, FP16 operations fell back to CPU, dramatically hurting performance.
Optimization 3: TensorRT
NVIDIA TensorRT is a high-performance inference optimizer. It analyzes the network and applies:
Layer fusion
Kernel auto-tuning
Precision calibration (FP16/INT8)
Memory optimization
trt_options = {
"trt_fp16_enable": True,
"trt_engine_cache_enable": True,
"trt_engine_cache_path": "trt_cache",
"trt_cuda_graph_enable": True,
}
session = ort.InferenceSession(
"model.onnx",
providers=[
("TensorrtExecutionProvider", trt_options),
"CUDAExecutionProvider"
]
)Key options explained:
trt_fp16_enable: Use half precision where beneficialtrt_engine_cache_enable: Cache the optimized engine (first run is slow, subsequent runs are fast)trt_cuda_graph_enable: Reduce CPU overhead between kernel launches
Result: RTF 0.0178 — 1.70x faster! 🚀
Optimization 4: CTranslate2
CTranslate2 is a C++ inference engine optimized for transformer models. It supports INT8 quantization which compresses weights to 8-bit integers.
And the model here is the base part after trying to implement a Class wrapper so I will be able to convert the modified architecture with Ctranslate2 because their automatic script failed to convert it.
# Convert model
converter = TransformersConverter("wav2vec2bert_base")
converter.convert("ct2_model", quantization="int8")
# Load and run
encoder = ctranslate2.models.Wav2Vec2Bert(
"ct2_model",
device="cuda",
compute_type="int8"
)Result: RTF 0.0189 — 1.60x faster
Final Results Summary
| Runtime | RTF | Speedup | Notes |
|---|---|---|---|
| PyTorch (baseline) | 0.0303 | 1.00x | Reference |
| PyTorch + torch.compile | 0.0303 | 1.00x | No gain on GTX 1660 Ti |
| ONNX CUDA | 0.0263 | 1.15x | Easy win |
| ONNX Optimized | 0.0258 | 1.17x | Graph fusion |
| ONNX FP16 | 0.1170 | 0.26x | ❌ Avoid without Tensor Cores |
| CTranslate2 INT8 | 0.0189 | 1.60x | Great for deployment |
| TensorRT FP16 | 0.0178 | 1.70x | Best result |
Serving the Model
With optimization complete, we needed a production-ready API. We chose LitServe a lightweight Python framework that integrates seamlessly with our TensorRT-optimized model.
Server Architecture
class TensorRTAPI(ls.LitAPI):
def setup(self, device):
# Load ONNX model with TensorRT provider
# Initialize tokenizers for all 11 output heads
def decode_request(self, request):
# Receive audio file upload
# Load and preprocess with librosa
# Extract features with HuggingFace processor
def predict(self, input_features):
# Run TensorRT inference
# Return logits for all heads
def encode_response(self, outputs):
# CTC decode each head
# Return JSON with all predictionsAPI Usage
curl -X POST http://localhost:8000/predict \
-F "audio=@recitation.wav"Response:
{
"phonemes": "ءِنَلَاهَبِكُلِشَيءِنعَلِۦمُ",
"ghonna": "[لا غنة][مغن][لا غنة]",
"qalqla": "[لا قلقلة]",
...
}The Dependency Hell with ONNX
Okay let me be real with you.
I spent more time fighting package versions than actually optimizing the model. Not joking.
So you install librosa because you need audio stuff. It pulls in numba. Numba wants numpy<=2.0. Fine whatever.
Then you need onnxruntime-gpu. Guess what? It wants numpy>=2.1.
Everything breaks. You Google. Stack Overflow says “just downgrade”. You downgrade. Now onnxruntime breaks. Great.
And don’t get me started on ctranslate2 — the moment you install it, it downgrades like 5 packages you didn’t even know you needed.
The TensorRT disaster:
I ran uv pip install tensorrt. Waited. And waited. 12 hours later — still downloading. I gave up and went to sleep.
Next day I found out you need NVIDIA’s special index:
pip install tensorrt --extra-index-url https://pypi.nvidia.comTook 30 mins. WHY IS THIS NOT THE DEFAULT.
Honestly? Just use different virtual environments. Don’t be like me trying to fit everything in one place. Your sanity is worth more. ## Key Learnings
Know your hardware: GTX 1660 Ti lacks Tensor Cores, so FP16 without TensorRT hurts performance
TensorRT is powerful: Even on consumer GPUs, it provides significant speedups
Cache your engines: TensorRT engine building is slow; always enable caching
INT8 is viable: CTranslate2’s INT8 quantization offers good speedups with minimal accuracy loss
Profile first: Don’t assume — measure each optimization’s actual impact
There is more to come, and for the other part of optimization I am still exploring… Thanks!