<%@LANGUAGE="JAVASCRIPT" CODEPAGE="65001"%> NIST Speech Group Website
Information Technology Lab, Information Access Division NIST: National Institute of Standards and Technology


  • Speech Group Home
  • Benchmark Tests
  • Tools
  • Test Beds
  • Publications
  • Links
  • Contacts
  • NIST Automatic Meeting Transcription Project

    meeting image Tremendous effort has been devoted to mining information in newswire, news broadcasts, and conversational speech and in developing interfaces to metadata extracted in these domains. However, little has been done to address such applications in the more challenging and equally important meeting domain. The meeting domain includes a large number of subdomains including judicial proceedings, legislative proceedings, lectures, seminars, board meetings, and a variety of less formal group meeting types. All of these meeting types could benefit immensely from the development of automatic recognition, understanding, and information extraction technologies which could be linked with a variety of online information systems. Currently, no such technologies exist.

    The development of smart meeting room technologies which can recognize information about meeting participants, speech, interactions, relationships, decision-making activities, and information dissemination will provide an invaluable resource for a variety of business, academic, and governmental applications. Such core metadata can provide a valuable basis for the development of second-tier meeting applications incuding automatic minute/transcription generation, summarization/abstraction, translation, retrospective search and retrieval, and semantic analysis. Third-tier applications will provide a context-aware collaborative interface between live meeting participants, remote participants, meeting archives and vast online resources. Given that the core meeting recognition technologies are in a fledgling or nonexistent state, it is essential that these first tier recognition technologies be developed before the higher tier applications can be made useful.

    NIST has begun a project to address development of the core recognition applications and is collaborating with several other Government agencies and research institutions working on developing such technologies in the meeting domain.

    Introduction

    NIST has created a project to support the development of audio and video recognition technologies in the context of human-human meetings. The project includes periodic technology evaluations and workshops as well as an extensive data collection effort. In addition to collecting its own multi-media meeting corpora, NIST is collaborating with several other data collection sites to provide a broad base of corpora for research, development and evaluation. The recognition technologies currently addressed by the effort include speech-to-text transcription (STT), speaker segmentation, and video information extraction.

    NIST conducted a first exploratory evaluation of speech recognition in meetings in Spring 2002 (see Rich Transcription). The evaluation addressed STT and speaker tasks. The evaluation showed that speech recognition performance for this domain is significantly worse than for other more heavily researched domains (conversational telephone speech and news broadcasts). This is not merely an issue of research funding. Rather, it is because the meeting domain poses several acoustic and language challenges not found in these domains. NIST is also collaborating with other agencies in the area of video analysis and content extraction, to support evaluation of several video extraction technologies in the context of meetings including person/object identification/tracking, text extraction, gesture recognition, and other video event-based tasks.

    Unique Technical Challenges

    The meeting domain has several important properties not found in other domains and which are not currently being addressed in other research programs:

    Multiple Forums and Vocabularies
    Meeting forums can range from very informal group meetings to highly structured judicial and legislative proceedings. Likewise, meeting vocabularies can vary widely depending on both the meeting topic and degree of shared context among the participants. The number and organizational hierarchy of the participants also greatly influences the types of both verbal and non-verbal interactions among the participants.

    Highly-Interactive/Simultaneous Speech
    The speech found in certain forms of meetings is spontaneous and highly interactive across multiple participants. Further, meeting speech contains frequent interruptions and overlapping speech. These attributes pose great challenges to speech recognition technologies which are currently typically single-speaker/single speech stream contextual.

    Multiple Distant Microphones
    Meetings are generally recorded with multiple distant microphones. Speech recognition techniques currently work quite poorly with distant microphones and techniques have yet to be developed which efficiently integrate input from multiple microphones. Optimal techniques will take advantage of the positioning of multiple distant microphones to capture and recognize more speech than could be recognized using a single such microphone while taking advantage of the signal redundancy across microphones to improve accuracy.

    Multiple Camera Views
    Unlike commercial broadcasts, in which only a single edited video stream is produced, meetings can be recorded and processed using multiple simultaneous camera views. Much like the multi-microphone challenge above, this permits/challenges the technology to integrate data from multiple video inputs to enhance the metadata that can be extracted from the meeting and improve overall recognition accuracy.

    Multi-Media Information Integration
    It is impossible to develop a complete understanding of meetings without analysing a number of different signal types simultaneously: audio (speech, speaker ID, and other noises); video (faces, gestures, emotions, positions, physical interactions with other people and objects, non-verbal communication [e.g., group dynamics and agreement]); and information sources (information devices and resources participants interact with).

     

     

    Page Created: August 23, 2007
    Last Updated: February 14, 2008

    Speech Group is part of IAD and ITL
    NIST is an agency of the U.S. Department of Commerce
    Privacy Policy | Security Notices|
    Accessibility Statement | Disclaimer | FOIA