Automatic Timestamp Recalculation During Audio Transcript Editing

Technical Disclosure Commons Technical Disclosure Commons

Defensive Publications Series

19 Jan 2021

Automatic Timestamp Recalculation During Audio Transcript Automatic Timestamp Recalculation During Audio Transcript

Editing Editing

Hao Wu

Follow this and additional works at: https://www.tdcommons.org/dpubs_series

Recommended Citation Recommended Citation

Wu, Hao, "Automatic Timestamp Recalculation During Audio Transcript Editing", Technical Disclosure

Commons, (January 19, 2021)

https://www.tdcommons.org/dpubs_series/3984

This work is licensed under a Creative Commons Attribution 4.0 License.

This Article is brought to you for free and open access by Technical Disclosure Commons. It has been accepted for

inclusion in Defensive Publications Series by an authorized administrator of Technical Disclosure Commons.

Automatic Timestamp Recalculation During Audio Transcript Editing

ABSTRACT

Automatic speech-to-text software sometimes produces transcripts that include errors

which users correct manually, e.g., by inserting or deleting words. It is important that the edited

transcript bears word-level timestamps that are faithful to the raw (unedited) transcript. This

disclosure describes techniques that can be applied in real time to automatically and accurately

recalculate timestamps of words in a transcript even as a user makes edits to the transcript. With

immediate/ real-time timestamp recalculation as described herein, users can play a recording and

edit the corresponding transcript while simultaneously enjoying transcript-editing utilities like

karaoke-style highlighting; word, sentence, or paragraph insertion, replacement, or deletion;

paragraph break insertion; paragraph merging; playing or sharing selected audio segments

(sentences, paragraphs, etc.) of the transcript; etc.

KEYWORDS

● Transcript editing

● Transcript timestamps

● Audio timestamp

● Speech-to-text

● Karaoke highlighting

● Longest common subsequence

(LCS)

BACKGROUND

Automatic speech-to-text software, e.g., used for online or offline meetings or in other

contexts, sometimes produces transcripts that include errors. Users edit and correct transcripts for

various reasons, e.g., to have high-quality records; to reduce misunderstandings that may arise

from inaccuracy; to enable searches through transcripts; to restructure transcripts to make them

easy to read, to quote, or refer to; to enable accurate sharing of information; to generally parse

Wu: Automatic Timestamp Recalculation During Audio Transcript Editing

Published by Technical Disclosure Commons, 2021

information; etc. Transcript-editing software enables users to simultaneously replay the audio,

display the transcript by the side (highlighting karaoke-style the word that is currently being

played), and make corrections to the transcript.

Fig. 1: Transcript-editing software in operation

Defensive Publications Series, Art. 3984 [2021]

https://www.tdcommons.org/dpubs_series/3984

Fig. 1 illustrates an example of transcript-editing software in operation. In this example,

audio that includes a conversation is provided as input to the software. Figures 1(a)-(d) represent

consecutive snapshots of the user interface (UI) in time. Buttons (102) are provided to play or

pause the audio waveform. Fig. 1(a) represents an initial (unedited) transcript, which includes

paragraph timestamps (104a).

In Fig. 1(b), as the audio plays, the user introduces an edit, e.g., the insertion of a

paragraph-break. The newly-formed paragraph is assigned its own paragraph-level timestamp

(104b) based on word-level timestamps. The word-level timestamps remain invisible to the user.

In Fig. 1(c), as the audio continues playing, the user makes another edit (106a), e.g., the

relabeling of the speaker of the second paragraph. Fig. 1(d) shows the paragraph with the re-

labeled speaker (106b). Even as the audio continues playing and the user simultaneously makes

edits, the currently played word is highlighted karaoke-style (108, pink).

Fig. 2: A word in a raw transcript is annotated with start and end timestamps

As shown in Fig. 2, speech-to-text software generally annotates each word in a transcript

with timestamps that indicate the start and the end of word-utterances. Although word-level

timestamps do not appear in the user interface, these are critical to many transcript-editing

functions, e.g., karaoke-style highlighting; word, sentence, or paragraph insertion, replacement,

or deletion; paragraph (or other) break insertion; paragraph (or other) merging; playing or

sharing selected audio segments (sentences, paragraphs, etc.) of the transcript; etc.

Wu: Automatic Timestamp Recalculation During Audio Transcript Editing

Published by Technical Disclosure Commons, 2021

A transcript that is edited by the user may include insertions or deletions. For transcript-

editing features to continue to work, it is important that the edited transcript bears word-level

timestamps that are faithful to the raw (unedited) transcript. Specifically, the timestamps of the

new (inserted) words must lie on a continuum between original words at the ends of an inserted

string, the deletion of words must not introduce timestamp inaccuracy, and timestamps must

monotonically increase with the text in the edited transcript.

Although there are transcript-editors that enable users to modify timestamps, such editors

do not automatically recalculate word timestamps, or even display the timestamps on the UI. As

a result, the user has to resort to manually associating sentences or chunks of text to parts of the

recording by looking at and replaying chunks of the audio. It is important to reduce or eliminate

the burden of assigning timestamps to edited words that falls on the user who edits the transcript.

DESCRIPTION

This disclosure describes techniques that can be applied in real time to automatically and

accurately recalculate timestamps of words in a transcript even as a user makes edits to the

transcript. Users can simply edit transcripts as if editing a plain text document (similar to editing

in a text editor) without manually reassigning word timestamps. Since word timestamps are

recalculated as the user makes edits, users can play an audio recording and edit the

corresponding transcript while simultaneously enjoying transcript-editing utilities like karaoke-

style highlighting; word, sentence, or paragraph insertion, replacement, or deletion; paragraph

(or other) break insertion; paragraph (or other) merging; playing or sharing selected audio

segments (sentences, paragraphs, etc.) of the transcript; etc.

Defensive Publications Series, Art. 3984 [2021]

https://www.tdcommons.org/dpubs_series/3984

Fig. 3: Real-time timestamp-recalculation in audio transcripts

As illustrated in Fig. 3, raw and edited transcripts are compared to determine the longest

common subsequences (LCS) between them (302). LCS determination is cast as a dynamic

programming problem whose complexity is reduced from exponential to polynomial using

windowing, memoization (304), and pruning (306). Once LCS is determined between raw and

edited transcripts, timestamps are calculated for the words in the edited transcripts (308). For

example, the timestamps for words inserted between two original (or anchor) words are

calculated generally as a linear interpolation between the timestamps of the anchor words. The

techniques are described in greater detail below.

Determine the LCS between raw and edited transcripts (dynamic programming formulation)

The longest common subsequence length between two sequences of words s

and s

denoted by C(s

, s

). For example,

C(‘and yet this mottled dawn’, ‘mottled dawn is unlike our dream’) = 2,

Wu: Automatic Timestamp Recalculation During Audio Transcript Editing

Published by Technical Disclosure Commons, 2021

since the longest common subsequence between the strings ‘and yet this mottled dawn’ and

‘mottled dawn is unlike our dream’ is ‘mottled dawn’ which is of length 2. In computing

common subsequence length, the subsequence needs not be continuous . For example,

C(‘to be or not to’, ‘to or be not is to’) = 4

since the longest common subsequence between the input strings is ‘to be not to’ (of length 4),

obtained from the first argument as ‘to be or not to’ and from the second argument as ‘to or be

not is to.’

Given an unedited string s

of length n + 1 words and an edited string s

of length m + 1

words, the following recursion holds.

C( s

[0 … n], s

[0 … m] ) =

{

C( s

[0 … n−1], s

[0 … m−1] )

+ 1,

if s

[n] = s

[m]

max( C( s

[0 ... n−1], s

[0 ... m] ),

C( s

[0 ... n], s

[0 ... m−1]

) ),

if s

[n] ≠ s

[m]

The above recursion sets up the determination of the LCS between strings s

and s

as a dynamic

programming problem, which, in its raw form, however, is exponential in complexity.

Fig. 4: Windowed LCS

To reduce the complexity of LCS computation, the sequences s

and s

being compared

are windowed, as illustrated in Fig. 4. The windowed LCS recursion is centered on the user’s

current cursor location, and looks back and looks ahead a finite number (n) of words, where n

Defensive Publications Series, Art. 3984 [2021]

https://www.tdcommons.org/dpubs_series/3984

can be, for example, forty. Windowing reduces the number of states in the dynamic

programming model such that operations can be completed without exceeding memory

constraints.

Memoization

Fig. 5: Illustrating memoization (Source: [2])

As illustrated in Fig. 5, memoization can be carried out by caching state space in a two-

dimensional matrix. In this context, state refers to C(s

[0 ... i], s

[0 ... j]) for any i  [0, n], j  [0,

n]. The row-header of the matrix is the unedited transcript, e.g., ABCDA, and the column-header

is the edited transcript, e.g., ACBDEA (where A, B, …, E represent words). An entry represents

an LCS solution for some truncated s

and s

. The final LCS between s

and s

can be found by

backtracking along the pink pathway: the LCS sequence is represented by the dashed red boxes.

For example, the longest common subsequences at various time-points in the matrix are

as follows.

Wu: Automatic Timestamp Recalculation During Audio Transcript Editing

Published by Technical Disclosure Commons, 2021

(A, A)

= 1

(AB, A)

= 1

(ABC,

AC)

= 2

(ABC, ACB)

= 2

…

(ABCDA, ACBDEA)

= 4

Pruning

Fig. 6: Illustrating pruning

Pruning builds on the observation that at every step the recursion increases the length of

the longest common subsequence by at most one. Thus, the maximum value in row j can be at

most 1 greater than that of row j - 1. As a result, the upper-bound of the maximum value of the

next row in the state matrix can be predicted and calculation in that row can be stopped once the

upper-bound value is reached. In light of this, as illustrated in Fig. 6, many entries in the upper

diagonal of the state-space matrix need not be computed, thus further reducing complexity.

Defensive Publications Series, Art. 3984 [2021]

https://www.tdcommons.org/dpubs_series/3984

Computing timestamps of words in the edited transcript

Fig. 7: Computing timestamps of words in the edited transcript

As illustrated in Fig. 7, the longest common subsequences between the raw and the edited

transcripts determine the common words at either end of an edited string. These common words

are known as the starting anchor word and the ending anchor word. In the example of Fig. 7, the

string “catfishing them lawyers” in the raw transcript has been edited by the user to “catching

fish and lobsters.” The starting anchor word, e.g., the common word just prior to the edits, is

“here,” and the ending anchor word, e.g., the common word just after the edits, is “tonight.” As

explained earlier, the raw transcript has word-level timestamps (see Fig. 2). These word-level

timestamps can be used to determine the starting and ending timestamp of the edited phrase. In

the example of Fig. 7, the start-time and the end-time of the edited phrase are respectively 0

seconds and 1 second.

Word-level timestamps can be found for the edited transcript in various ways. For

example, as shown in Fig. 7 (red font), the word-level transcripts can be a linear interpolation

between the start- and end-times of the edited phrase (accounting for the length of the phrase in

letters, 28 in this example, whitespace included). As a special case, if just one word is deleted

and replaced, then the replaced word has the same word-level timestamp as the original word. If

Wu: Automatic Timestamp Recalculation During Audio Transcript Editing

Published by Technical Disclosure Commons, 2021

no anchor words are found, linear interpolation is carried out within the current context (the time

boundary of the edited phrase is determined solely by neighboring words in the edited

transcript).

The described techniques can be implemented as part of standalone transcript editing

software, or transcript editing as implemented in other software, to provide instantaneous updates

in the user interface as a user edits transcripts of any recording, e.g., a recording of an

online/offline meeting.

CONCLUSION

This disclosure describes techniques that can be applied in real time to automatically and

accurately recalculate timestamps of words in a transcript even as a user makes edits to the

transcript. With immediate/ real-time timestamp recalculation as described herein, users can play

a recording and edit the corresponding transcript while simultaneously enjoying transcript-

editing utilities like karaoke-style highlighting; word, sentence, or paragraph insertion,

replacement, or deletion; paragraph break insertion; paragraph merging; playing or sharing

selected audio segments (sentences, paragraphs, etc.) of the transcript; etc.

REFERENCES

1. Speech-to-Text: Automatic Speech Recognition https://cloud.google.com/speech-to-text

2. Dynamic Programming - Longest Common Subsequence | TutorialHorizon

https://algorithms.tutorialhorizon.com/dynamic-programming-longest-common-subsequence/

3. Trint: Audio Transcription Software | Speech to Text to Magic https://trint.com/

4. Descript | Create podcasts, videos, and transcripts https://www.descript.com/

Defensive Publications Series, Art. 3984 [2021]

https://www.tdcommons.org/dpubs_series/3984