kyutai/stt-1b-en_fr · What is the tokenization and alignment approach? i.e. collation

Sep 11

•

As far as I can tell, the model aligns text tokens with each audio frame, puts <pad> for frames with no text token, and puts <unk> right before any series of non-pad tokens.

Is that correct?

So to collate inputs for training, does this mean I should have something like this for text:

- <bos><pad><pad><unk><hello><pad><pad><unk><how><_are>...

and I guess I need to shift these forward by 0.5s as well?

if I want to collate for training

lmz

Kyutai org Sep 11

That's right, the token 0 is used just before every new word and the token 3 should be used for padding between words.
Note that a word can be made of multiple tokens, in which case all of them take place just after the .

RonanMcGovern

Sep 11

Many thanks, that's very helpful.

One further question that would help a lot. Right now I'm using whisper-timestamped, that gives me word timestamps, but I then need to align token timestamps.

So I'm thinking I:

Take the full string of text
Use the word timestamps to apply timestamps to the full string of text based on characters.
Tokenise the text (which now also show have some tokens corresponding to timestamps - mapped via characters).

Now, there will be tokens without timestamps. I suppose I should:

Work left to right, take any tokens without timestamps and put them in a frame right before or after a token with a timestamp (there's a bit of nuance here if there are some words spanning multiple tokens).
Use unk for any frames right before groups of tokens
all other frames should then be pad?

Am I overcomplicating this or is there a simpler policy? merci

lmz

Kyutai org Sep 11

Sorry I'm not sure to get exactly what you mean here. Maybe you can get an example? We also use whisper-timestamped to get word level timestamps, use these to place the first 0 (with the appropriate text-audio delay), then put the word tokens and the 3s to fill the gaps until the 0 for the next word. The caveat is if there are too many tokens and that would go over the next 0, you can either discard these sequences or just shift the 0 and following tokens in that specific case (and then go back to the normal alignment when possible).

RonanMcGovern

Sep 11

This answers my question. So I go:

pad pad pad
unk at the first word (delayed) timestamp
tokens for the first word
pad until the next word timestamp OR add the next word tokens if that next (delayed) timestamp is reached first
unk at the next word (delayed) timestamp

RonanMcGovern changed discussion status to closed Sep 11

RonanMcGovern

Sep 16

•

edited Sep 16

@lmz I have a few follow-up questions, if you don't mind. I'm trying to collate the data so that I can run a forward pass (i.e. model.forward(**batch_output).

Is the frame hop 0.08 seconds (1920 frame size / 24000 sample rate)?
The text should have a text bos prepended, and the audio should have an audio bos prepended, correct?
For inputs, I get the best results on a forward pass (of text and audio) when I set a delay of around 0.8 seconds on the text tokens. This makes me think either I have the wrong frame hop OR the wrong delay (I know 0.5 s should be delay for the 1B param model).
Even with that 0.8s delay, the forward method does not exactly match the generate method. Is there any reason why running a forward method (greedy) would yield different results to the generate method?
When I run the forward method, and compare side-by-side with the generate method, I find that the first non-pad token coming from the generate method appears to have a 0.8-0.9 second delay. Is there additional padding to be done beyond the 0.5 second delay?

Merci

BTW the generate method works well. My goal is to get the model.forward working because then I know how to train if that is working.

RonanMcGovern changed discussion status to open Sep 16

RonanMcGovern

Sep 19

Howdy @lmz , if you have a few mins, would help a lot any guidance you could share. I'm preparing a video to go on YouTube.com/@trelisresearch on the topic.

lmz

Kyutai org Sep 19

A couple answers below:

Frame hop is indeed 80ms.
Indeed the text should start with a bos text_initial_token_id and the audio with token 2048 initial_token_id. This should be handled for you in LMGen and LMModel see here.
0.5s should indeed be the delay, but it might be after the end of the word rather than the start in this model case, I would have to double check (we had both variants at some point).
Not sure which forward method you mention here. I guess the best to line things up would be to add print statements on both side and see what the model actually takes as input.
Probably the time it takes to get to the first word + the potential word duration?

RonanMcGovern

Sep 19

Interesting, so the delay might be on a word by word basis, and always be 0.5 + the word duration?

The forward method I mean is model.forward(), which I can run once I load the model with transformers.

And yes, I have done those. print statements to compare but will try to make a small repro example next time if I can't get things working, and share that here.

lmz

Kyutai org Sep 19

Yes I can confirm that for this model the delay is taken from the end of the word rather than the beginning, so it's indeed 0.5 + word-duration when taken from the start of the word.

RonanMcGovern

Sep 22

Thanks very much, that helped a lot. To be clear, is it:
a) place the unk token for each word at its ending timestamp plus the delay, or
b) place the unk token for the first word at its ending timestamp plus the delay, and then delay all later unk tokens by that same amount.

My guess is a, correct?

lmz

Kyutai org Sep 22

I think it's more like (a): for each word we shift the stop timestamp by the delay, the word tokens start at this point in time and we put a unk token just before such word starts (except the maybe for the first word).