Integrating Speech Recognition and Machine Translation: Where do We Stand

of 4

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
4 pages
0 downs
Integrating Speech Recognition and Machine Translation: Where do We Stand
  INTEGRATING SPEECH RECOGNITION AND MACHINE TRANSLATION:WHERE DO WE STAND?  Evgeny Matusov, Stephan Kanthak, Hermann Ney Lehrstuhl f¨ur Informatik VI - Computer Science DepartmentRWTH Aachen University, Aachen, Germany. { matusov,kanthak,ney } ABSTRACT This paper describes improvements to the interface between speechrecognition and machine translation. We modify two different ma-chine translation systems to effectively process dense speech recog-nition lattices. In addition, we describe how to fully integrate speechtranslation with machine translation based on weighted finite-statetransducers. With a thorough set of experiments, we show that boththe acoustic model scores and the source language model positivelyandsignificantlyaffectthetranslationquality. Wehavefoundconsis-tent improvements on three different corpora compared with transla-tions of single best recognized results. 1. INTRODUCTION Over the last decade it has been demonstrated by many publicationsand research projects that automatic speech recognition (ASR) andmachine translation (MT) can be coupled in order to directly trans-late spoken utterances into another language. Whereas the most sim-ple speech translation systems translate single best recognizer out-put, a few attempt to benefit from considering multiple recognitionhypotheses for an utterance. Such attempts can be classified by thetype of input that the systems use. A simple extension to translatingonly the single best ASR output is translations of the  N  -best ASRhypotheses. Recently, moderate improvements with this approachwere reported by e.g. [3] and [4]. A more tighter coupling of ASRand MT is reached when word lattices are translated; the lattices canalso be converted to confusion networks. In the past, some improve-ment of translation quality was achieved by using lattices with smalldensities [12]. Finally, a fully integrated approach where the wholesearch space of ASR and MT is integrated can be pursued. In thepast, this approach was successful only on very small tasks [13].When coupling speech recognition and machine translation, therecognition model scores and the translation model scores can becombined to improve translation performance. A theoretical basisfor the score combination was given in [9]. One can differentiatebetween joint probability speech translation systems and conditionalprobability systems. In both types of systems, the ASR acoustic andlanguage model scores can be combined with the translation fea-tures. The recognition features can either be included directly in thesearch, or in a post-processing step by rescoring word lattices or  N  -best lists.This paper is organized as follows. Based on the presentation of [9], Section2reviewstheBayes’decisionruleforspeechtranslation.Starting from there, in Section 3 we show how ASR word latticescan be translated and review the basics of our two speech translationsystems: the joint-probability system and the phrase-based systemthat employs log-linear modeling. Section 4 explains the function-ality of the fully integrated speech translation system. In Section 5we present significant improvements in quality of translation whenwe utilize recognition features in translation and optimize the modelscaling factors. 2. BAYES’ DECISION RULE FOR SPEECH TRANSLATION In speech translation, we try to find the target language sentence  e I  1 which is the translation of a speech utterance represented by acous-tic vectors  x T  1  . In order to minimize the number of sentence errors,we maximize the posterior probability of the target language trans-lation given the speech signal (see [9]). The source words  f  J  1  areintroduced as a hidden variable: ˆ e ˆ I  1  = argmax I,e I 1 Pr ( e I  1 | x T  1  )= argmax I,e I 1 { Pr ( e I  1 )  · Pr ( x T  1  | e I  1 ) } = argmax I,e I 1 { Pr ( e I  1 )  · X f  J  1 Pr ( f  J  1  | e I  1 )  · Pr ( x T  1  | f  J  1  ,e I  1 ) }∼ = argmax I,e I 1 { Pr ( e I  1 )  ·  max f  J  1 { Pr ( f  J  1  | e I  1 )  · Pr ( x T  1  | f  J  1  ) }} Note that we made the natural assumption that the speech signaldoes not depend on the target sentence and approximated the sumover all possible source language transcriptions by the maximum. Pr ( x T  1  | f  J  1  )  may be a standard acoustic model, and  Pr ( e I  1 )  is thetarget language model.As already stated in [9], the conditional probability term Pr ( f  J  1  | e I  1 )  and  P  ( e I  1 )  can be rewritten when using a  joint   proba-bility translation model: ˆ e ˆ I  1  ∼ = argmax I,e I 1 { max f  J  1 { Pr ( f  J  1  ,e I  1 )  · Pr ( x T  1  | f  J  1  ) }} This simplifies coupling the systems since the joint probability trans-lation model can be used instead of the usual language model inspeech recognition (see Section 4).It should be noted that speech translation based on word lat-tices uses the additional approximations that word boundary timesare fixed and that many word sequences may never be contained inthe word lattice due to the word-pair or word-triple approximation.  3. SPEECH TRANSLATION SYSTEMS AT RWTH3.1. WFST-Based Joint Probability System The joint probability MT system (referred to as FSA, for a moredetailed description see also [6]) is implemented with WFSTs. First,the training corpus is transformed as shown in Figure 1, based ona word alignment. Then a statistical  m -gram model is trained onthe bilingual corpus. This language model is represented as a finite-state transducer  Tr  which is the final translation model. Searchingfor the best target sentence is done in the composition of the inputrepresented as a WFST and the translation transducer  Tr .Coupling the FSA system with ASR is fairly simple since theoutput of the recognizer represented as WFST can be used directlyas input to the machine translation search. For the FSA-based speechtranslation system the only features used are the acoustic probabil-ity from the input word lattice and the translation model probability.The source language model scores are not included, since the joint m -gram translation probability contains dependencies on the prede-cessor source words and thus serves as a source language model. vorrei|i’d like del|some gelato|ice creamper| ε  favore|please Fig. 1 . Example of a transformed sentence pair. 3.2. Phrase-Based System The phrase-based translation system (referred to as PBT) follows adirect modeling approach. Probability distributions are representedas features in a log-linear model. In particular, the translation modelprobability Pr ( f  J  1  | e I  1 ) isdecomposedintoseveralprobabilities. Themain feature is the phrasal translation lexicon. It is supplemented bysinglewordbasedlexiconprobabilities. Lexicafrombothtranslationdirections are also used. In addition, we include the target languagemodel, as well as the word and phrase penalty features to avoid tooshort/long translations.Each feature is scaled by a separate exponent. The scaling fac-tors are optimized in a minimum error training framework [10] withthe Downhill Simplex algorithm iteratively, by performing 100 to200 translations of a development set. The criterion for optimizationis an objective machine translation error measure like word error rateor BLEU score.For speech translation we additionally include the acousticmodel probabilities  Pr ( x T  1  | f  J  1  )  of the hypotheses in the ASR wordlattices and probabilities of the source language model as features.Details are given in [7]. When searching for the best translation, thesystem has to optimize over alternative recognition word sequences f  J  1  (as given by the input word lattice), over all possible monotonesegmentations of a given recognized sequence into source languagephrases, and over all possible translations of these phrases.The utilization of multiple features and the direct optimizationfor an objective error measure is the main advantage of this system incomparison to the FSA system. However, it is paid by a less efficientsearch, which makes heavy pruning unavoidable. 3.3. Reordering Appropriate reordering of words/phrases in translation is very im-portant for good performance of MT systems, since there are signif-icant differences in typical word order between most languages (seealso [6]). In case of ASR word lattice input, the reordering in searchis a complex problem. Here, we present two basic solutions.In the FSA system, the search is monotone. However, we re-order words in each sentence in the target training corpus based onthe initial word alignment such that the resulting alignment becomesmonotone. Obviously, resulting translationswill havethewordorderof the source sentence. To fix the wrong word order, we use a similaridea to that described in [2]. Given the best translation sentence wefirst permute the words and then compose the resulting permutationautomaton with an  n -gram target language model in order to selectthe word order with the highest probability. The computational com-plexity can be reduced by using constraint permutation automata.In recent modification of the PBT system, limited word reorder-ing is possible. While traversing the input lattice, a matched sourcephrase can be skipped and processed later. This type of reorderinghelped to improve translation quality, see Section 5. 4. FULL INTEGRATION OF ASR AND MT As the PBT system is more complicated to integrate with speechrecognition search, we only use the FSA system for the fully inte-grated speech translation. We start by representing the static ASRsearch network as a composition of multiple WFSTs (see also [8]),namely  H   for the HMM topology,  C   for the context-dependency(CART),  L  for the lexicon, and  G  for the language model. As thetransducer cascade  H   ◦  C   ◦  L  already represents the conditionalprobability term  Pr ( s T  1  | f  J  1  )  for a given HMM state sequence  s T  1  , we only need to replace the source language model  G  by the trans-lation model  Tr  to get the final optimized speech translation searchnetwork   ST   : ST   =  det ( H   ◦ det ( C  − 1 ) − 1 ◦ det ( L ◦ Tr )) The problems faced in the optimized composition are: •  Tr  is ambiguous on the input side. This can be solved byadding disambiguation symbols to the input side of   Tr  as de-scribed in [8] for the lexicon. •  unknown words, i.e. source words contained in the lexicon,but not in the input language of  Tr , must be passed to the out-put language of the translation model  Tr.  This is performedby preprocessing  Tr. 5. EXPERIMENTS5.1. Corpus Statistics The speech translation experiments were carried out on three dif-ferent tasks. Experiments for all tasks were based on bilingualsentence-aligned corpora. Corpus statistics for these tasks are givenin Table 1.The Italian-English  Basic Travel Expression Corpus  (BTEC)task contains tourism-related sentences usually found in phrasebooks for tourists going abroad. We were kindly provided with thiscorpus by ITC-IRST. Speech translation experiments were also per-formed on a smaller Chinese-English BTEC corpus [1] in the frame-work of the IWSLT 2005 evaluation campaign [14]. 16 referencetranslations of the correct transcriptions for the BTEC test corporawere available.The Italian-English Eutrans II FUB task contains sentences fromthe domain of hotel help-desk requests. It is significantly smallerthan the BTEC task and has evolved from one of the first European-funded speech translation projects. Only a single reference transla-tion is available for the test corpus on this task.  Table 1 . Corpus statistics of the speech translation tasks BTEC and Eutrans II.BTEC Eutrans II FUBItalian English Chinese English Italian EnglishTrain Sentences 66107 20000 3257Running Words 410275 427402 176199 189927 47681 57663Vocabulary 15983 10971 8687 6870 2453 1695Singletons 6386 3974 4006 2888 975 519Test Sentences 253 506 300Running Words 1459 1510 3918 3909 5305 6419Out-Of-Vocabulary rate [%] 2.5 0.9 2.3 1.8 2.3 1.3ASR WER [%] 21.4 - 42.0 - 23.7 - 5.2. Evaluation Criteria For the automatic evaluation, we used word error rate (WER),position-independent word error rate (PER), and the BLEU score[11]. The BLEU score measures accuracy, i.e. larger scores are bet-ter. On all tasks, training and evaluation were performed using thecorpus and references in lowercase and without punctuation marks. 5.3. Translation of Word Lattices We compare the performance of the transducer-based joint proba-bility system and of the phrase-based system on the BTEC Italian-English task. We consider three translation conditions: translatingsingle best recognition output, translating ASR word lattices withoutthe acoustic model scores, and including the acoustic model scoresin the ASR word lattice in the global decision process.For the FSA system, a  4 -gram translation model was estimatedon the bilingual representation of the training corpus for this task. A 4 -gram target languagemodel wasused in search for the PBT systemas well as to score permutations of the final hypotheses from the FSAsystem. In order to include the source language model feature in thePBTsystem, weextendedeachwordlatticebythescoresofatrigramlanguage model.The objective error measures for the two systems on the BTECItalian-English task are summarized in Table 2. We observe thatexploring the word lattice topology in translation already results insome improvement in the translation quality. However, the improve-ments are more significant when we combine recognition model fea-tures with the translation model features. In the case of the FSA sys-tem, as mentioned in Section 3, we interpolate the acoustic modelscore and the translation model score. It is important to optimizethe scaling factor for the translation model score. On this task, thescaling factor is 45 and is higher than the usual LM scaling factor inspeech recognition.WhenusingthePBTsystem, weincludeboththeacousticmodeland the source language model score. The language model score isused to model the context dependency for the source language whichiscapturedonlywithinthesourcephrasesofthephrasallexicon. Thescaling factors for the recognition features only or for translation andrecognition features simultaneously were optimized in the log-linearmodel on a development set for the word error rate. Table 2 showsthe improvements in translation quality on the test set when usingoptimized scaling factors.Table 2 also shows that the PBT system not only performs bet-ter in terms of absolute error measures, but also is able to achieve alarger relative improvement (8% vs. 5.4% in WER) with the inte-grated approach of word lattice translation based on log-linear mod-eling. Table 2 . Translation results [%] on the BTEC Italian-English task.Comparisonofthelog-linearmodelapproach(PBT)withtheWFST-based joint probability approach (FSA).System: Input: WER PER BLEUPBT single best 32.4 27.2 55.4word lattice 31.9 28.0 54.7ac. + LM scores 30.6 26.6 56.2opt all factors 29.8 25.8 57.7FSA single best 33.4 29.1 52.7lattice + ac. scores 31.6 27.6 54.3 5.4. Importance of Reordering As mentioned in Section 3.3, both of the described speech transla-tion systems can be improved by allowing limited reordering. In thecase of the FSA system, target sentences were reordered in training,but the lattice was processed monotonically. After translation, theresulting single best hypotheses were permuted under the IBM re-ordering constraints with a window size of 3 and scored with a targetlanguage model. This has further reduced the number of translationerrors on the BTEC Italian-English task, as shown in Table 3.Postponing the translation of a matched phrase and thus allow-ing limited reordering in the search also helps to improve the perfor-mance of the PBT system. However, this improvement is significantonly when translating from a language with the word order largelydifferent from English, e.g. Chinese. Local reorderings which aretypical for the Italian-English translations are already captured in thebilingual phrasal lexicon of the system. Table 4 presents the transla-tion results on the Chinese-English BTEC task. Performing the lim-ited reordering clearly results in better translation quality for bothASR single best output and word lattice translation. Table 3 . Effect of target reordering in training and after translationfor word lattice translation on the BTEC Italian-English task (FSAsystem, results in [%]).Reordering: WER PER BLEUnone 31.6 27.6 54.3target 30.6 26.0 55.4 Table 4 . Effect of phrase reordering in search on the BTEC Chinese-English task (PBT system, results in [%]).Reordering: Translation of: WER PER BLEUnone single best 62.1 52.7 31.1lattice 58.3 48.1 34.1skip single best 61.3 51.7 33.1lattice 57.7 47.2 35.1  Table 5 . Translation results [%] on the Eutrans II FUB Italian-English task. The last line contains results when directly couplingthe speech recognition and machine translation systems by using asingle optimized finite-state network.Input: WER PER BLEUcorrect text 29.1 22.1 58.8single best 37.4 29.1 51.3word lattice 38.2 29.5 50.2+ ac. scores 36.6 28.1 52.4integrated 36.3 28.0 52.6 Table 6 . Comparison of speech recognition and speech translationsearch characteristics for the Eutrans II FUB Italian-English task (AMD Athlon64 2.0GHz; RTF: real-time factor).system # active states RTFASR 1,872 0.35ST 14,379 1.26 5.5. Fully Integrated Speech Translation The experiments for fully integrated speech translation were per-formed on the Eutrans II FUB corpus. For better comparison wegenerated lattices with different densities. The lattice error rate, i.e.the minimum WER among all paths through the lattice, was as lowas 9.1% on average for the largest lattice density of 2098. We op-timized the system with respect to both the lattice density and thetranslation model scaling factor  λ  simultaneously. In contrast to theresults presented in [12], the WER consistently drops with larger lat-tices and shows a clear minimum for  λ  = 90  (for comparison: theoptimal language model scaling factor for the ASR system is  16 ).Results of all error measures for the optimal settings are given inTable 5. The target word reordering in training and after translationwas performed as described in Section 3.3. Different from [6], weconsistently use a trigram language model to generate lattices and atrigram translation model here. The last line of the table shows thatthe fully integrated system performs better than the system usinglarge lattices which is another proof that the error rate does not risewith larger lattice densities. Note that although the speech recogni-tion system has a slightly worse WER on this task compared to [5],we obtain a much better speech translation WER.Table 6 compares the search space of the network described inSection 4, using either the usual language model  G , or the transla-tion model Tr . In both cases, pruning thresholds were adjusted to beminimal and to not produce search errors. Both static pre-compiledsearch networks were about the same size with the speech trans-lation network being slightly bigger. Speech translation had about7.7 times more active state hypotheses and was slower than speechrecognition by a factor of 3.6. This can partly be contributed to thehigh ambiguity of the translation model as the same input sentencemay have many different translations. On the Eutrans II FUB task,we observed an average of 2.9 target phrases per source word in thebi-language. 6. CONCLUSIONS In this paper, we gave a short overview of the current research oncoupling speech recognition and machine translation. We presentedtwo state-of-the-art speech translation systems which consistentlyperform better when translating ASR word lattices with acousticand/or language model scores, or even in a fully integrated speechtranslation architecture. These improvements are significant andwereachievedonseveraltasks. However, on(largevocabulary)taskswith good ASR performance, the MT performance is yet to be gen-erally improved to avoid translating word sequences which containrecognition errors. Also, the key to success of speech translation isa closer cooperation of the ASR and MT researchers who have toagree on common standards for e.g. the lattice structure, definitionof vocabulary, segmentation, and other practical interface issues. 7. REFERENCES [1] Y. Akiba, M. Federico, N. Kando, H. Nakaiwa, M. Paul, andJ. Tsujii. 2004. Overview of the IWSLT04 evaluation cam-paign. In  Proc. IWSLT  , pp. 1–12, Kyoto, Japan, September.[2] S. Bangalore and G. Riccardi, “Finite-State Models for LexicalReordering in Spoken Language Translation”, Proc. Int. Conf.on Spoken Language Processing, vol. 4, pp. 422–425, Beijing,China, 2000.[3] A. Bozarov, Y. Sagisaka, R. Zhang, G. Kikui. “ImprovedSpeech Recognition Word Lattice Translation by ConfidenceMeasure”. In Proc. Interspeech 2005, pp. 3197–3200, Lisbon,Portugal, 2005.[4] N. Bertoldi, “Statistical Models and Search Algorithms forMachine Translation”, PhD thesis, Universit`a degli Studi diTrento, Italy, February 2005.[5] F. Casacuberta, D. Llorens, C. Mart´ınez, S. Molau, F. Nevado,H. Ney, M. Pastor, D. Pic´o, A. Sanchis, E. Vidal, and J. M. Vi-lar, “Speech-to-speech Translation Based on Finite-StateTransducers”, Proc. IEEE Int. Conf. on Acoustics, Speech, andSignal Processing, pp. 613–616, Salt Lake City, UH, 2001.[6] E. Matusov, S. Kanthak, and H. Ney, “On the Integrationof Speech Recognition and Statistical Machine Translation”,In Proc. Interspeech 2005, pp. 3177–3180, Lisbon, Portugal,2005.[7] E. Matusov and H. Ney, “Phrase-based Translation of SpeechRecognizer Word Lattices Using Loglinear Model Combina-tion”, To appear in Proc. Int. Workshop on Automatic SpeechRecognition and Understanding, Cancun, Mexico, 2005.[8] M. Mohri, F. C. N. Pereira and M. Riley, “Weighted Finite-State Transducers in Speech Recognition”, Proc. ISCA Work-shop, ASR2000, Paris, France, 2000.[9] H. Ney, “Speech Translation: Coupling of Recognition andTranslation”, Proc. IEEE Int. Conf. on Acoustics, Speech andSignal Processing, pp. 1149–1152, Phoenix, AZ, 1999.[10] F.J.Och, “Minimum Error Rate Trainingin Statistical MachineTranslation”, In Proc. of the 41th Annual Meeting of the As-sociation for Computational Linguistics (ACL), pp. 160–167,Sapporo, Japan, July 2003.[11] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU:a Method for Automatic Evaluation of Machine Translation”,Proc. 40th Annual Meeting of the ACL, Philadelphia, PA, pp.311–318, 2002.[12] S. Saleem, S.-C. Jou, S. Vogel, and T. Schultz, “Using WordLattice Information for a Tighter Coupling in Speech Transla-tion Systems”, Proc. Int. Conf. on Spoken Language Process-ing, pp. 41–44, Jeju Island, Korea, 2004.[13] E. Vidal, “Finite-State Speech-to-Speech Translation”, Proc.IEEE Int. Conf. on Acoustics, Speech and Signal Processing,pp. 111–114, Munich, Germany, 1997.[14] R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu,Y. Zhang, and H. Ney “The RWTH Phrase-based StatisticalMachine Translation System”, to appear in Proc. Int. Work-shop on Spoken Language Translation (IWSLT), Pittsburgh,USA, 2005.
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks