Evaluation
Benchmarks
Comparative evaluation of MULAMU models against standard multilingual baselines on four metrics — BLEU-4, BERTScore F1, clinical accuracy, and response relevance — across Luganda, Runyankore-Rukiga, Swahili, and English.
All scores are research estimates evaluated on the MULAMU Primary Dataset (held-out test split). Human ratings were collected from clinical staff at MUST, Uganda.
Best model — MULAMU 1.3B Medical
Top scores across all metrics
BLEU-4
31.4
N-gram overlap with reference answers. Higher = more faithful output.
BERTScore F1
83.2
Semantic similarity via contextual embeddings. Scale 0–100.
Clinical Accuracy
68.5
Human-rated correctness by clinical staff. Scale 0–100.
Response Relevance
74.7
Human-rated relevance to the question. Scale 0–100.
All models
Model comparison
Coloured bars = MULAMU models · Grey bars = baselines
BLEU-4
N-gram overlap with reference answers. Higher = more faithful output.
BERTScore F1
Semantic similarity via contextual embeddings. Scale 0–100.
GPT-2
117M
BLOOM
560M
mT5
580M
MULAMU 125M Multi
125M
MULAMUMULAMU 125M Med
125M
MULAMUMULAMU 1.3B Med
1.3B
MULAMUClinical Accuracy
Human-rated correctness by clinical staff. Scale 0–100.
Response Relevance
Human-rated relevance to the question. Scale 0–100.
MULAMU 1.3B Med
1.3B
MULAMUMULAMU 125M Med
125M
MULAMUMULAMU 125M Multi
125M
MULAMUmT5
580M
BLOOM
560M
GPT-2
117M
Language breakdown
MULAMU models by language
BLEU-4 and BERTScore F1 for the two medical-fine-tuned MULAMU models, evaluated separately per language.
Language
BLEU · 1.3B
BLEU · 125M
BERTScore · 1.3B
BERTScore · 125M
Key finding
MULAMU models score highest on English and Swahili — languages with more training data — and are closest to English performance on Luganda and Runyankore-Rukiga because of the domain-specific fine-tuning. The 1.3B model outperforms the 125M variant by ~6–8 BLEU points across all languages.
Methodology
How we evaluate
All models are evaluated on a held-out 15% split of the Primary Dataset that was not seen during training.
BLEU-4 and BERTScore F1 are computed automatically against reference answers from the dataset.
Clinical accuracy and response relevance are rated by clinical staff at MUST, Uganda, on a random 200-sample subset using a structured rubric (0–4 Likert scale, normalised to 0–100).
Next steps
Use or improve our models
All models are freely available on Hugging Face. If you have data, annotations, or compute that could improve these results, we'd love to collaborate.