Fig. 1 illustrates an example of transcript-editing software in operation. In this example,
audio that includes a conversation is provided as input to the software. Figures 1(a)-(d) represent
consecutive snapshots of the user interface (UI) in time. Buttons (102) are provided to play or
pause the audio waveform. Fig. 1(a) represents an initial (unedited) transcript, which includes
paragraph timestamps (104a).
In Fig. 1(b), as the audio plays, the user introduces an edit, e.g., the insertion of a
paragraph-break. The newly-formed paragraph is assigned its own paragraph-level timestamp
(104b) based on word-level timestamps. The word-level timestamps remain invisible to the user.
In Fig. 1(c), as the audio continues playing, the user makes another edit (106a), e.g., the
relabeling of the speaker of the second paragraph. Fig. 1(d) shows the paragraph with the re-
labeled speaker (106b). Even as the audio continues playing and the user simultaneously makes
edits, the currently played word is highlighted karaoke-style (108, pink).
Fig. 2: A word in a raw transcript is annotated with start and end timestamps
As shown in Fig. 2, speech-to-text software generally annotates each word in a transcript
with timestamps that indicate the start and the end of word-utterances. Although word-level
timestamps do not appear in the user interface, these are critical to many transcript-editing
functions, e.g., karaoke-style highlighting; word, sentence, or paragraph insertion, replacement,
or deletion; paragraph (or other) break insertion; paragraph (or other) merging; playing or
sharing selected audio segments (sentences, paragraphs, etc.) of the transcript; etc.
4
Wu: Automatic Timestamp Recalculation During Audio Transcript Editing
Published by Technical Disclosure Commons, 2021