Cross Domain Automatic Transcription on the TC-STAR EPPS Corpus

of 4

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
4 pages
0 downs
Cross Domain Automatic Transcription on the TC-STAR EPPS Corpus
  CROSS DOMAIN AUTOMATIC TRANSCRIPTIONON THE TC-STAR EPPS CORPUS Christian Gollan, Maximilian Bisani, Stephan Kanthak, Ralf Schl¨ uter, Hermann Ney Lehrstuhl f ¨ur Informatik VI – Computer Science DepartmentRWTH Aachen University, D-52056 Aachen, Germany { gollan,bisani,kanthak,schluter,ney } ABSTRACT This paper describes the ongoing development of the BritishEnglish European Parliament Plenary Session corpus. Thiscorpus will be part of the speech-to-speech translationevaluation infrastructure of the European TC-S TAR  project.Furthermore, we present  Þ rst recognition results on theEnglish speech recordings. The transcription system hasbeen derived from an older speech recognition system builtfor the North-American broadcast news task. We reporton the measures taken for rapid cross-domain porting andpresent encouraging results. 1. INTRODUCTION A speech-to-speech translation (SST) system consits of three components: automatic speech recognition (ASR),machine translation and speech synthesis. The developmentof new approaches for SST demands corpora from a singledomain to improve and evaluate all of these components ininteraction.The optimization of a statistical ASR system requireslarge task representive databases for acoustic and languagemodel training. Today’s ASR systems can be still improvedby the use of 1,000 hours of speech data in the acousticmodel training [1]. But the manual transcription of speechdata is very time consuming and costly. Depending onthe dif  Þ culty of the transcription task and on the level of consistency and correctness one wants to achieve, the effortranges is between 10 and 100 man hours for one hour of recorded speech. However, it was reported that the impactof transcription errors on the performance of thereby trainedASR systems is not as bad as one might expect. Sundaramand Picone [2] report that 16% falsely labeled training dataleads to an performance loss of 8.5% relative to the baselineon the Switchboard corpus. It can be assumed that the useof additional cheap but inaccurately transcribed speech datacan redeem this performance loss. This work was partially funded by the European CommissionUnion under the Human Language Technologies project TC-S TAR  (FP6-506738). Unsupervised training relies on methods for the auto-matic transcription of speech data. Such methods are basedon an initial ASR system which is then iteratively improvedby its auto-generated transcripts. The transcription errorscan be reduced by  Þ ltering the most likely error-prone data.Such  Þ lter methods can be based on con Þ dence measures[3] or if available on aligned closed captions [1]. Theinitial automatic transcription system can be trained by asmall manually transcribed subset of the audio data [3].An attractive alternative is using an existing ASR systemwhichwasdevelopedforasimilartask. Duetomismatchingconditions (domain, dialect, acoustic environments, trans-mission channel, etc.) the accuracy of such an initialautomatic transcript may be poor, but some steps can betaken to lighten this problem. This paper describes such across-domain porting of an existing ASR system for initialautomatic transcription. 2. TC-STAR The TC-S TAR  project (Technology and Corpora for Speechto Speech Translation) is envisioned as a long term effort fo-cused on advanced research in all core technologies for SST[4]. The project will target a selection of unconstrained con-versational speech domains - i.e. broadcast news, politicalspeeches, and discussion forums - and a few languages rele-vant for Europe’s economy and society: European English,European Spanish and Chinese. The technical challengesand objectives of the project will focus on the developmentof new algorithms and methods, integrating relevant humanknowledge which is available at translation time into adata-driven framework. Examples of such new approachesare the extension of statistical machine translation modelsto handle multiple sentence hypothesis produced by thespeech recognizer, the integration of linguistic knowledgein the statistical approach of spoken language translation,the statistical modeling of pronunciation of unconstrainedconversational speech in automatic speech recognition, andnew acoustic and prosodic models for generating expressivespeech in speech synthesis. I - 8250-7803-8874-7/05/$20.00 ©2005 IEEEICASSP 2005          ➠  ➡  ORESEN FROR internet service EN ES FRES FREN text datainterpretersaudio datasrcinal speakersatellite live broadcast Rainbow Text EditionFinal Text Editions Fig. 1 . Overview of the available EPPS resourcesOne of the project goals is the implementation of anevaluation infrastructure based on competitive evaluation,in order to achieve the desired breakthroughs in SST. TheEuropean Parliament Plenary Session (EPPS) is an attrac-tive domain for the development of such a new evaluationinfrastructure. The following section describes this domainand our ongoing data collection effort. At the end of the project, the resulting language resources will be madeavailable to the public through ELDA/ELRA [5]. 3. THE EPPS CORPUS The  European Parliament   (EP) holds plenary sessions usu-ally six days each month. The major part of the sessionstake place in Strasbourg (France) while the residual ses-sions are held in Brussels (Belgium). Today the EuropeanParliament consists of members from 25 countries, and 20of  Þ cial languages are spoken. The sessions are chaired bythe President of the European Parliament. Typically whenthe president hands over to a member of the parliament,the speaker’s microphone is activated. Interjections fromthe Parliament are therefore softened in the recording. Si-multaneous translations of the srcinal speech are providedby interpreters in all of  Þ cial languages of the EU. Figure 1gives an overview of the structure of existing EPPS data.It is possible to categorize speakers in two ways: Firstlythere are native speakers well as non-native speakers whohave more or less pronounced accent. Secondly thereare srcinal speakers and interpreters. Although most of the speeches are planned, almost all speakers exhibit theusual effects known from spontaneous speech (hesitations,false starts, articulatory noises). The interpreters’ speakingstyle is somewhat choppy: dense speech intervals (“bursts”)alternate with pauses when the interpreter is listening to thesrcinal speech. Rainbow Text Edition Verbatim Transcription It is for our Parliament, as It is for our Parliament, aswe have already marked we have already markedin a symbolic ceremony in a symbolic ceremony outside , a special and  outdoor  , a special andextraordinary moment. extraordinary moment.  In Dublin last Saturday, It was described in Dublinlast Saturday captured inthe words of  Ireland’s Nobel literature Ireland’s Nobel literaturelaureate Seamus Heaney laureate Seamus Heaney, captured this special event he talked about and I quotewith the words  ... ... Fig. 2 . Excerpt of a Rainbow Text Edition and the corre-sponding transcript out of the EPPS early test dataThe European Union’s TV news agency,  Europe bySatellite (EbS)  [6], provides Europe related informationvia internet and satellite. EbS broadcasts the EP PlenarySessions live in the srcinal language and the simultaneoustranslations via satellite on different audio channels: onechannel for each of  Þ cial language of the EU and an extrachannel for the original untranslated speeches. Thesechannels are additionally available as 30 minute long in-ternet streams for one week after the session. The audiotransmissions are monaural. The internet audio streamshave a sample rate of 16 kHz and are encoded with theRealAudioSiprocodecatabitrateof16kbit/s. Thesatelliteaudio streams have a sample rate of 48 kHz and are encodedwith the MPEG 1 layer II codec at a bit rate of 64 kbit/s.In May 2004 we have started recording the EPPSbroadcasts in  Þ ve languages (English, Spanish, German,French, and Italian). These  EPPS recordings  from Mayto July were made from the EbS transmissions. While inMay the recordings were made from internet audio streams,both sources (internet and satellite) were recorded in July.In this period we have recorded a total of 25 hours for eachlanguage. The internet streams from May and the satellitestreams from July have been selected for transcription. Weare currently in progress of manually segmentating andtranscribing the English audio streams. As of this writingan early non-validated transcription of the EPPS recordingof May the 3rd has been produced. This labeled subset hasa duration of one hour and will be referred to as the  EPPS early test data  in the remainder of the paper.The compilation of texts of the speeches given bymembers of the European Parliament in plenary sessions isknown as the  Rainbow Text Edition  (RTE). Every speech inthese reports appears in the language used by the speakerwho is allowed to make corrections to the text afterwards. I - 826         ➡  ➡  The reports are published on the EUROPARL web site [7]on the day after the EPPS. The Final Text Edition (FTE)in all of  Þ cial languages of the EU is accessible about twomonths later. The web site also provides all previous reportssince April 1996. We currently work with the availablereports to build an English-Spanish parallel text corpus forthe TC-S TAR  project. The RTE and FTE aim for highreadability, and therefore do not provide a strict word-by-wordtranscript. Notabledeviationsfromthesrcinalspeechinclude removal of hesitations, false starts and word inter-ruptions. Furthermore transposition, substitution, deletionand insertion of words can be observed in the reports. Anexample is given in Figure 2. Table 1 . Statistics of speech corporaHub-4 EPPStraining test early testacoustic data 96.5h 2.9h 1.0hsilence portion 14% 12% 8%# speakers  ≈ 3 , 157  ≈ 116  ≈ 22 # utterances 26,136 728 442# running words 1,053,050 32,834 8,782 4. SYSTEM DESCRIPTION We have conducted initial recognition experiments with twoobjectives in mind: use of automatic transcriptions to assisthuman transcribers and evaluate the potential of applyingunsupervised training methods.The experiments to be reported in this paper wereperformed with our single-pass across-word, trigram rec-ognizer. The recognition vocabulary comprises 65k words.More details about the system are given in [8]. The baselinesystem was setup for the the 1997 Hub-4 data from theDARPA benchmark evaluation. This corpus consists of transcribed American English broadcast news recordings.Table 1 gives an overview of the corpus statistics.The acoustic vectors are computed by applying a lineardiscriminant analysis on several adjacent vectors consistingof 16 mel-frequency cepstral coef  Þ cients without deriva-tives. The gender dependent acoustic models consists of triphones which are represented by 6-state HMMs withskip, forward and loop transitions. Gaussian mixtures withagloballypooleddiagonal covarianceareusedformodelingtheHMMstateswhicharetiedusingadecisiontree. Silenceis modeled using a single state HMM which is separatedfrom the state of the other HMMs and not included in statetying. During training maximum approximation is applied.Our baseline system achieves a word error rate (WER)of 19.3% on the Hub-4 1997 evaluation test corpus afterNIST scoring. Best published results are around 14%WER, however these systems use multiple passes, speakeradaptation and 4-gram language models [9] [10]. 5. LANGUAGE MODEL The  Þ rst step in improving performance on the EPPSdata was building a new in-domain language model (LM).Therefore we used the data of the preprocessed EPPSreports from our English-Spanish EPPS parallel text corpus,which was further normalized: e.g. abbreviations andnumbers were written out in full. The English text contains30 million running words and does not contain any EPPSreports covering the time period of the recordings. (Theseare not available yet.) From this data a trigram LM was builtusing the SRI Language Modeling Toolkit [11] applyingabsolute discounting with interpolation (modi Þ ed Kneser-Ney smoothing) [12]. 6. VOCABULARY PORTING Clearly the OOV rate of the EPPS data with the Hub-4language model is rather high. To alleviate this problemwe have added the most frequent 7,000 words from theEPPS data that were missing in the recognition vocabu-lary. To provide the phonetic transcriptions for the newwords we have used the data-driven grapheme-to-phonemeconversion approach described in [13]: a grapheme-to-phoneme conversion model was trained on the existingHub-4 lexicon, and then used to generate pronunciationsfor the newly added words. This approach is domain andlanguage independent and requires no human expertise inphonetics or English pronunciation.There is a remaining mismatch is in the pronunciationlexicon: The Hub-4 pronunciation lexicon and acousticmodel are designed for North American English. The EPPSdata however include native British Speakers as well asnon-english speakers who approximate British or AmericanEnglish pronunciation at various levels of competence.Since acoustic models are strongly tied to the pronunciationdictionarytheyweretrainedwith, itisnotpossibletosimplychange the latter to better match the observed pronuncia-tions. 7. EXPERIMENTS As a  Þ rst test, we ran the Hub-4 ASR system as it wason the EPPS recordings. Considering the large mismatchin domain and speaking style, the recognition results were I - 827         ➡  ➡  surprisingly good. Table 2 gives an overview of the experi-mental results on the one hour long labeled EPPS early testdata. The word error rate (WER) of the system was mea-sured with respect to the recently made manual transcripts.The unmodi Þ ed Hub-4 baseline system achieves a WER of 39.1% on this test data. As expected the perplexity of theout-domainHub-4LMontheEPPStranscriptsisquitehigh. Table 2 . Experimental results with the Hub-4 system on theEPPS early test dataacoustic languagemodel lexicon model WER perpl. OOVHub-4 Hub-4 Hub-4 39.1% 207.1 1.7%Hub-4 Hub-4 EPPS 34.4% 168.8 1.7%Hub-4 EPPS EPPS 33.9% 167.4 1.0%ToimprovetheperformancewebuiltanewLMfromtheEnglish FTE documents restricting the vocabulary to that of the existing Hub-4 system. The perplexity of this LM issigni Þ cantly lower than that of the Hub-4 LM, leading to a12% relative reduction in WER.In the next experiment we enlarged the Hub-4 lexiconwith the most frequent 7,000 missing words from theEnglish FTE documents vocabulary. This was done withgrapheme-to-phoneme conversion. This enlarged vocab-ulary was used for building a new LM. The so modi Þ edHub-4 system achieves 33.9% WER on the EPPS early testdata. 8. CONCLUSION We have reported on the EPPS corpus currently beingdeveloped in the TC-S TAR  project. One project goal indeveloping this corpus is the creation of an evaluationinfrastructure for SST.We generate manual transcripts of the acoustic data. Toalleviate this time consuming task we consider the use of automatic transcription systems to aid human transcribersas well as the use of unsupervised training methods. Thisarticle describes the  Þ rst steps in this direction. Wepresented the rapid and cheap cross-domain porting of anexisting Hub-4 system, which has improved the recognitionperformance by 13% WER relative (from 39% to 34%WER). 9. REFERENCES [1] L. Nguyen and B. Xiang, “Light Supervision inAcoustic Model Training,” in  2005 IEEE Interna-tional Conference on Speech, Acoustics, and SignalProcessing , March 2004, pp. 185–188.[2] R. Sundaram and J. Picone, “Effects on TranscriptionErrors on Supervised Learning in Speech Recogni-tion,” in  2005 IEEE International Conference onSpeech, Acoustics, and Signal Processing , March2004, pp. 169–172.[3] F. Wessel and H. Ney, “Unsupervised Training forBroadcast News Speech Recognition,” in  Proc. IEEE  Automatic Speech Recognition and UnderstandingWorkshop , December 2001.[4] “TC-STAR: Technology and Corpora for Speech toSpeech Translation,”[5] Evaluations and Language resourcesDistribution Agency, “ELDA,”[6] European Union’s TV news agency, “Europe bysatellite,”[7] The Secretariat of the European Parliament,“EUROPARL: Plenary Session reports,”[8] A. Sixtus, S. Molau, S. Kanthak, R. Schl¨uter, andH. Ney, “Recent Improvements of the RWTH LargeVocabulary Speech Recognition System on Sponta-neous Speech,” in  Proc. IEEE International Con- ference on Acoustics, Speech and Signal Processing ,June 2000, pp. 1671–1674.[9] Jean-Luc Gauvain, Lori Lamel, Gilles Adda, andMich`ele Jardino, “Recent advances in transcribingtelevision and radio broadcasts,” in  Proc. EuropeanConf. on Speech Communication and Technology , Bu-dapest, Hungary, Sept. 1999, vol. 2, pp. 655–658.[10] Long Nguyen, Spyros Matsoukas, Jason Davenport,Daben Liu, Jay Billa, Francis Kubala, and JohnMakhoul, “Further advances in transcription of broad-cast news,” in  Proc. European Conf. on SpeechCommunication and Technology , Budapest, Hungary,Sept. 1999, vol. 2, pp. 667–670.[11] A. Stolcke, “SRILM - An Extensible Language Mod-eling Toolkit,” in  Proc. Intl. Conf. Spoken LanguageProcessing , September 2002.[12] Stanley F. Chen and Joshua Goodman, “An empiricalstudy of smoothing techniques for language model-ing,”  Computer Speech & Language , vol. 13, no. 4,pp. 359 – 394, Oct. 1999.[13] Maximilian Bisani and Herman Ney, “Multigram-based grapheme-to-phoneme conversion for LVCSR,”in  Proc. European Conf. on Speech Communicationand Technology , Geneva, Switzerland, Sept. 2003,vol. 2, pp. 933 – 936. I - 828         ➡  ➠
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks