Creating a speech corpus: A new blog series

So you want to create a new speech dataset? There are a lot of things to consider at every stage of the process. This is the first (introductory) post in a series I’m starting on the topic, based on my experience developing the SpiCE corpus of Speech in Cantonese and English. There are undoubtedly things I could have done better, but in any case, I certainly learned a lot about speech data along the way.

What’s in the series?

Since this is the introductory post, there’s not going to be much in the way of actual advice here, but I do want to let you know what’s coming. Tips, musings, and resources fall into a few clear stages:

Before you begin
Planning and practicing
Data collection
Transcribing words
Annotations and beyond
Sharing, promoting, and the perks of #openaccess

Which I suppose means there will be at least six posts (no promises on the final count). And since I’m a busy dissertating Ph.D. student, no guarantees on timing either! I will edit this post to reflect what I end up writing and add links as appropriate.

Why does this matter?

Developing speech data sets is time-consuming and expensive. It’s also essential. Part of the before-you-begin post will deal with the question of whether the dataset you’re planning actually needs to be collected. Does something similar exist already? Could you collect it from existing resources?

Putting aside questions of whether to record something new, I have no doubts about the importance of quality speech data. There are many examples of differences across linguistic research when you look at different kinds of speech data. Laboratory versus conversation-in-the-wild is a classic comparison. Attending to the type and quality of the speech data informs our science.

If you look at the speech tech sector instead, the importance is still there, even if the goals don’t look the same as academic linguistics. And, there has been a lot (well, at least some!) chatter about how the workhorse of speech and language technology is data. Knowing how to collect it. Knowing how to curate it. Knowing how to understand it. And then, of course, knowing what to do with it.

So with that, I leave you hanging until my next post. Feel free to drop a comment below if there’s something you’d really like me to cover!

What’s in the series?#

Why does this matter?#

What’s in the series?

Why does this matter?