Hi, everyone! My name is Mike Speriosu and I was the main developer of a set of TV search improvements in SnapStream 6.1 that involve language. I recently finished my doctorate at the University of Texas at Austin where I worked on problems at the intersection of human language and computer science, and was lucky enough to join the SnapStream development team a few months ago.
Products like SnapStream are part of an ongoing trend in technology to help people interact with each other and with computers using their most advanced and expressive mode of communication: language. After working on a variety of linguistic and software problems in academia, I was very eager to apply my skills to a real-life product that could benefit from more linguistic intelligence.
One of the most compelling features of SnapStream's TV media monitoring technology has always been its ability to search recordings and send alerts based on what’s said on TV, and now recording and searching closed caption data work better than ever.
Closed Captioning Correction
Capitalization
A lot of the closed caption (CC) text that initially comes through with the video and audio signal is messy in various ways. Many CCs are in all caps or just capitalized oddly, so at least some degree of re-capitalization is necessary before we show and save the text, in order to be easily readable.
Previously, we essentially lowercased everything except the first letter of every sentence. We now use a combination of algorithms and dictionaries to be smarter about what should be capitalized. Names of people, places, companies, and more are now often properly capitalized.
Click image to expand.
Misspellings
Another problem with CC text is that it sometimes contains misspellings. We now use a dictionary and statistical model to detect when we think a word is misspelled, and automatically correct the spelling when we have high confidence that our fix is correct. We took special care not to make this feature too aggressive, because we know how annoying automatic corrections can be when they’re wrong. Making a valid correction, however, could be the difference between getting a relevant alert and not getting it.
To give SnapStream customers more control over these CC-altering features, we've included a custom dictionary where administrators can specify their own words and phrases that they want capitalized a certain way. We’ll never attempt to change the spelling of words in this custom dictionary.
We’re trying out these advanced CC-altering features on English text only, with plans to expand to other languages in the future. In the meantime, we have an algorithm that ensures incoming CCs are English before applying the features, so that your favorite Spanish or French shows don’t have their spelling corrected as if they’re English.
Did You Mean? Search Suggestions
When you searched in previous versions of SnapStream, you would sometimes get suggestions in the form of Did you mean ____? but the quality of these suggestions left a lot to be desired. Our suggestion engine now takes into account the statistical properties of your entire library of recordings, resulting in much more useful suggestions. We also now have the ability to give suggestions when searching the program guide, something that was absent from past releases.
Click image to expand.
Suffixes
Another improvement we’ve made to search makes it so that suffixes like -s, -ing, -ed, and so on do not affect whether a word counts as a match for your search term. For example, searching for campaign will now match recordings that mentioned campaign, campaigns, campaigned, and campaigning, giving you more relevant results with less effort.
Click image to expand.
Synonyms
We’ve also added some lists of synonyms to our search engine, so that searching for big will also match large, and similarly for many other words. We made an effort only to make such connections between words that really do mean the same thing the vast majority of the time.
Click image to expand.
Pronoun Matching
SnapStream returns results for pronouns like he, she, and I that likely refer to the same person named in a search, e.g. Lebron, in addition to exact matches on the name itself.
Pronouns are allowed to match names for up to about one minute after the name is mentioned. We hope this feature makes it even easier to find relevant information in your recordings without spending a long time refining your search query.
Click image to expand.
Accent Ignoring
Finally, we've made it so SnapStream recognizes accents and other special characters and treats them alike, so you don't have to worry about exact spelling in your searches to return all possible matches. If you search for entree, you'll get hits for both entree and entrée. And if you search for entrée, you'll get hits for both entrée and entree.
Click image to expand.
The End
We added these features with the intent of making recording and searching just work better, even if you don’t always notice the new feature that kicked in and just made your life easier. We appreciate software that simply works, and hope we’ve achieved that goal with this update!
About Mike Speriosu
Mike Speriosu received Bachelor's degrees in Computer Science and Linguistics from Stanford University and a Master's and PhD in Linguistics from the University of Texas at Austin. He has published work in computational linguistics and done consulting in software development for companies looking to beef up the linguistic intelligence of their products. He is now on the development team at SnapStream.