Creating a speech corpus #1: Before you begin

Before you start collecting data, you need to do some due diligence. Because as important as speech data sets are, they are not trivial to create, and you need to balance what you want from the data with the time and resources you can access. I don’t mean to suggest that developing speech data sets isn’t important (it is), but rather that it needs to happen after careful consideration. This post gets at some of the things you’ll want to think about before you start planning your dream corpus. ...

20 Jul 2021 · 6 min · Khia A. Johnson

Creating a speech corpus: A new blog series

So you want to create a new speech dataset? There are a lot of things to consider at every stage of the process. This is the first (introductory) post in a series I’m starting on the topic, based on my experience developing the SpiCE corpus of Speech in Cantonese and English. There are undoubtedly things I could have done better, but in any case, I certainly learned a lot about speech data along the way. ...

29 Jun 2021 · 2 min · Khia A. Johnson