Evaluation

Benchmarks

Comparative evaluation of MULAMU models against standard multilingual baselines on four metrics — BLEU-4, BERTScore F1, clinical accuracy, and response relevance — across Luganda, Runyankore-Rukiga, Swahili, and English.

All scores are research estimates evaluated on the MULAMU Primary Dataset (held-out test split). Human ratings were collected from clinical staff at MUST, Uganda.

Best model — MULAMU 1.3B Medical

Top scores across all metrics

BLEU-4

31.4

N-gram overlap with reference answers. Higher = more faithful output.

BERTScore F1

83.2

Semantic similarity via contextual embeddings. Scale 0–100.

Clinical Accuracy

68.5

Human-rated correctness by clinical staff. Scale 0–100.

Response Relevance

74.7

Human-rated relevance to the question. Scale 0–100.

All models

Model comparison

Coloured bars = MULAMU models · Grey bars = baselines

BLEU-4

N-gram overlap with reference answers. Higher = more faithful output.

0 – 40

MULAMU 1.3B Medical

1.3BMULAMU

31.4

MULAMU 125M Medical

125MMULAMU

24.1

MULAMU 125M Multilingual

125MMULAMU

18.7

mT5-Base

580M

14.3

BLOOM-560M

560M

9.8

GPT-2

117M

4.2

BERTScore F1

Semantic similarity via contextual embeddings. Scale 0–100.

0 – 100

61.4

GPT-2

117M

67.2

BLOOM

560M

70.8

mT5

580M

74.1

MULAMU 125M Multi

125M

MULAMU

78.6

MULAMU 125M Med

125M

MULAMU

83.2

MULAMU 1.3B Med

1.3B

MULAMU

0255075100

Clinical Accuracy

Human-rated correctness by clinical staff. Scale 0–100.

0 – 100

MULAMU 1.3B Medical

1.3BMULAMU

68.5

MULAMU 125M Medical

125MMULAMU

56.8

MULAMU 125M Multilingual

125MMULAMU

38.4

mT5-Base

580M

31.2

BLOOM-560M

560M

24.7

GPT-2

117M

18.3

Response Relevance

Human-rated relevance to the question. Scale 0–100.

0 – 100

MULAMU 1.3B Med

1.3B

MULAMU

MULAMU 125M Med

125M

MULAMU

MULAMU 125M Multi

125M

MULAMU

mT5

580M

BLOOM

560M

GPT-2

117M

Language breakdown

MULAMU models by language

BLEU-4 and BERTScore F1 for the two medical-fine-tuned MULAMU models, evaluated separately per language.

Language

BLEU · 1.3B

BLEU · 125M

BERTScore · 1.3B

BERTScore · 125M

🇺🇸English

38.2

29.4

86.1

81.2

🇹🇿Swahili

32.6

24.7

84.3

79.1

🇺🇬Luganda

28.4

21.3

81.7

77.4

🇺🇬Runyankore-Rukiga

26.3

20.8

80.8

76.3

Key finding

MULAMU models score highest on English and Swahili — languages with more training data — and are closest to English performance on Luganda and Runyankore-Rukiga because of the domain-specific fine-tuning. The 1.3B model outperforms the 125M variant by ~6–8 BLEU points across all languages.

Methodology

How we evaluate

All models are evaluated on a held-out 15% split of the Primary Dataset that was not seen during training.

BLEU-4 and BERTScore F1 are computed automatically against reference answers from the dataset.

Clinical accuracy and response relevance are rated by clinical staff at MUST, Uganda, on a random 200-sample subset using a structured rubric (0–4 Likert scale, normalised to 0–100).

Baseline details

GPT-2 (117M)

Vanilla causal LM, no multilingual or medical training.

BLOOM-560M

BigScience multilingual causal LM trained on 46 languages.

mT5-Base (580M)

Google seq2seq model pre-trained on 101 languages.

Next steps

Use or improve our models

All models are freely available on Hugging Face. If you have data, annotations, or compute that could improve these results, we'd love to collaborate.

Download a model Contribute data