xml - Advanced boolean search of JSON files containing speech-to-text data? -
i have hundreds of automatic machine transcripts of video , audio files. have every transcript in 5 formats: json, xml, srt, vtt, txt. (click here see example files.) json , xml files contain comprehensive data, including speaker id, confidence level, , timecodes.
i looking way mine or search data find words , phrases. need able submit boolean search query, click result , play video/audio file @ timecode of text result. necessary boolean operators not, and, or (just online search engine). example search: ("baseball bat" , park) or soccer
i'm thinking of simple interface.
basic options:
- search box
- minimum confidence level slider
ideas advanced options:
- speaker: "bob,joe,bill" (that is, speaker must 1 of these)
- maximum time allowed between words in , search: x.x seconds
- maximum time allowed between words in exact phrase search: x.x seconds
- words in exact phrase search must have same speaker: on/off
- words between , must have same speaker: on/off
- words between or must have same speaker: on/off
- words between , must found within chronological order: on/off
- ignore punctuation: on/off
simply put, need agent ransack timecodes and, if possible, miscellaneous options. i know specific , complex request. :) can give me leads on idea? don't want reinvent wheel. software/command line program/engine comes closest being able this? perhaps can adapt there.
thanks!
you can implement such system on top of solr/lucene http://lucene.apache.org/solr, however, need more experience implement required features.
for open source implementation of speech archival , indexing can check matterhorn
you can find details on matterhorn speech indexing in presentation
however, not way implement such functionality, can proceed language of choice , simple tools. ruby/php or node.js work here.
Comments
Post a Comment