Benchmarks

Comparative evaluation of MULAMU models against standard multilingual baselines on four metrics — BLEU-4, BERTScore F1, clinical accuracy, and response relevance — across Luganda, Runyankore-Rukiga, Swahili, and English.

All scores are research estimates evaluated on the MULAMU Primary Dataset (held-out test split). Human ratings were collected from clinical staff at MUST, Uganda.

Top scores across all metrics

BLEU-4

31.4

N-gram overlap with reference answers. Higher = more faithful output.

BERTScore F1

83.2

Semantic similarity via contextual embeddings. Scale 0–100.

Clinical Accuracy

68.5

Human-rated correctness by clinical staff. Scale 0–100.

Response Relevance

74.7

Human-rated relevance to the question. Scale 0–100.

Model comparison

Coloured bars = MULAMU models  · Grey bars = baselines

BLEU-4

N-gram overlap with reference answers. Higher = more faithful output.

0 – 40

MULAMU 1.3B Medical

1.3BMULAMU
31.4

MULAMU 125M Medical

125MMULAMU
24.1

MULAMU 125M Multilingual

125MMULAMU
18.7

mT5-Base

580M
14.3

BLOOM-560M

560M
9.8

GPT-2

117M
4.2

BERTScore F1

Semantic similarity via contextual embeddings. Scale 0–100.

0 – 100
61.4

GPT-2

117M

67.2

BLOOM

560M

70.8

mT5

580M

74.1

MULAMU 125M Multi

125M

MULAMU
78.6

MULAMU 125M Med

125M

MULAMU
83.2

MULAMU 1.3B Med

1.3B

MULAMU
0255075100

Clinical Accuracy

Human-rated correctness by clinical staff. Scale 0–100.

0 – 100

MULAMU 1.3B Medical

1.3BMULAMU
68.5

MULAMU 125M Medical

125MMULAMU
56.8

MULAMU 125M Multilingual

125MMULAMU
38.4

mT5-Base

580M
31.2

BLOOM-560M

560M
24.7

GPT-2

117M
18.3

Response Relevance

Human-rated relevance to the question. Scale 0–100.

0 – 100
74.7/ 100

MULAMU 1.3B Med

1.3B

MULAMU
63.4/ 100

MULAMU 125M Med

125M

MULAMU
47.6/ 100

MULAMU 125M Multi

125M

MULAMU
38.9/ 100

mT5

580M

31.5/ 100

BLOOM

560M

22.1/ 100

GPT-2

117M

MULAMU models by language

BLEU-4 and BERTScore F1 for the two medical-fine-tuned MULAMU models, evaluated separately per language.

Language

BLEU · 1.3B

BLEU · 125M

BERTScore · 1.3B

BERTScore · 125M

🇺🇸English
38.2
29.4
86.1
81.2
🇹🇿Swahili
32.6
24.7
84.3
79.1
🇺🇬Luganda
28.4
21.3
81.7
77.4
🇺🇬Runyankore-Rukiga
26.3
20.8
80.8
76.3

Key finding

MULAMU models score highest on English and Swahili — languages with more training data — and are closest to English performance on Luganda and Runyankore-Rukiga because of the domain-specific fine-tuning. The 1.3B model outperforms the 125M variant by ~6–8 BLEU points across all languages.

How we evaluate

All models are evaluated on a held-out 15% split of the Primary Dataset that was not seen during training.

BLEU-4 and BERTScore F1 are computed automatically against reference answers from the dataset.

Clinical accuracy and response relevance are rated by clinical staff at MUST, Uganda, on a random 200-sample subset using a structured rubric (0–4 Likert scale, normalised to 0–100).

Baseline details

GPT-2 (117M)

Vanilla causal LM, no multilingual or medical training.

BLOOM-560M

BigScience multilingual causal LM trained on 46 languages.

mT5-Base (580M)

Google seq2seq model pre-trained on 101 languages.

Use or improve our models

All models are freely available on Hugging Face. If you have data, annotations, or compute that could improve these results, we'd love to collaborate.