Leyu on Hugging Face

Welcome to the official Leyu by gheero organization on Hugging Face!

Featured Datasets (Leyu Amharic Dialects)

Leyu-amharic-shewa-dialect
Leyu-amharic-wello-dialect
Leyu-amharic-gonder-dialect
Leyu-amharic-gojjam-dialect

About the Datasets

Our datasets are a specialized collection of speech audio focused on low-resource African languages, currently emphasizing dialects of Ethiopian local languages. Designed primarily for Speech-to-Text (STT) research, the corpus captures the unique phonetic nuances and rhythmic patterns of different dialects.

The audio was recorded in real-world environments by contributors using mobile devices, providing diverse acoustic conditions that help train robust models. Every recording undergoes rigorous manual review, where designated reviewers verify transcript alignment and audio clarity.

To support inclusive and representative AI systems, we prioritized demographic diversity across the collection:

Gender Balance: balanced representation of male and female voices
Age Distribution: 18–35 years
Regional Diversity: native speakers from the specific regional zones of each dialect
Technical Environment: mobile-recorded in real-world conditions (background noise, varied microphones)

gheero Blogs

Explore more about our work on low-resource languages, dialect research, and inclusive AI development: