Efficient Natural Language Response Suggestion for Smart Reply

of 15

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
PDF
15 pages
0 downs
2 views
Share
Description
Efficient Natural Language Response Suggestion for Smart Reply MATTHEW HENDERSON, RAMI AL-RFOU, BRIAN STROPE, YUN-HSUAN SUNG, LÁSZLÓ LUKÁCS, RUIQI GUO, SANJIV KUMAR, BALINT MIKLOS, and RAY KURZWEIL, Google
Transcript
Efficient Natural Language Response Suggestion for Smart Reply MATTHEW HENDERSON, RAMI AL-RFOU, BRIAN STROPE, YUN-HSUAN SUNG, LÁSZLÓ LUKÁCS, RUIQI GUO, SANJIV KUMAR, BALINT MIKLOS, and RAY KURZWEIL, Google This paper presents a computationally efficient machine-learned method for natural language response suggestion. Feed-forward neural networks using n-gram embedding features encode messages into vectors which are optimized to give message-response pairs a high dot-product value. An optimized search finds response suggestions. The method is evaluated in a large-scale commercial application, Inbox by Gmail. Compared to a sequence-to-sequence approach, the new system achieves the same quality at a small fraction of the computational requirements and latency. Additional Key Words and Phrases: Natural Language Understanding; Deep Learning; Semantics; 1 INTRODUCTION Applications of natural language understanding (NLU) are becoming increasingly interesting with scalable machine learning, web-scale training datasets, and applications that enable fast and nuanced quality evaluations with large numbers of user interactions. Early NLU systems parsed natural language with hand-crafted rules to explicit semantic representations, and used manually written state machines to generate specific responses from the output of parsing [18]. Such systems are generally limited to the situations imagined by the designer, and much of the development work involves writing more rules to improve the robustness of semantic parsing and the coverage of the state machines. These systems are brittle, and progress is slow [31]. Eventually adding more parsing rules and response strategies becomes too complicated for a single designer to manage, and dependencies between the rules become challenging to coordinate across larger teams. Often the best solution is to keep the domains decidedly narrow. Statistical systems can offer a more forgiving path by learning implicit trade-offs, generalizations, and robust behaviors from data. For example, neural network models have been used to learn more robust parsers [14, 24, 29]. In recent work, the components of task-oriented dialog systems have been implemented as neural networks, enabling joint learning of robust models [7, 26, 27]. However these methods all rely on either an explicit semantic representation or an explicit representation of the task, always hand-crafted. End-to-end systems avoid using hand-crafted explicit representations, by learning to map to and from natural language via implicit internal vector representations [19, 25]. Such systems avoid the unnecessary contraints and bottlenecks inevitably imposed by the system designer. In that context, natural language understanding might be evaluated less in terms of an explicit semantic representation, and more by the utility of the system itself. The system shows evidence of understanding when it offers useful responses. Such end-to-end tasks are difficult: systems not only need to learn language but also must learn to do something useful with it. This paper addresses the task of suggesting responses in human-tohuman conversations. There are further challenges that arise when building an end-to-end dialog Corresponding authors: {matthen, rmyeid, 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. system, i.e. a computer agent that interacts directly with a human user. Dialog systems must learn effective and robust interaction strategies, and goal-oriented systems may need to interact with discrete external databases. Dialog systems must also learn to be consistent throughout the course of a dialog, maintaining some kind of memory from turn to turn. Machine learning requires huge amounts of data, and lots of helpful users to guide development through live interactions, but we also need to make some architectural decisions, in particular how to represent natural language text. Neural natural language understanding models typically represent words, and possibly phrases, sentences, and documents as implicit vectors. Vector representations of words, or word embeddings, have been widely adopted, particularly since the introduction of efficient computational learning algorithms that can derive meaningful embeddings from unlabeled text [15, 17, 20]. Though a simple representation of a sequence of words can be obtained by summing the individual word embeddings, this discards information about the word ordering. The sequence-to-sequence (Seq2Seq) framework uses recurrent neural networks (RNNs), typically long short-term memory (LSTM) networks, to encode sequences of word embeddings into representations that depend on the order, and uses a decoder RNN to generate output sequences word by word. This framework provides a direct path for end-to-end learning [23]. With attention mechanisms and more layers, these systems are revolutionizing the field of machine translation [28]. A similar system was initially used to deployed Google s Smart Reply system for Inbox by Gmail [11]. While Seq2Seq models provide a generalized solution, it is not obvious that they are maximally efficient, and training these systems can be slow and complicated. Also they are derived as a generative model, and so using them to rank a fixed set of responses (as in the context of Smart Reply) requires extra normalization to bias the system away from common responses. In a broader context, Kurzweil s work outlines a path to create a simulation of the human neocortex (the outer layer of the brain where we do much of our thinking) by building a hierarchy of similarly structured components that encode increasingly abstract ideas as sequences [12]. Kurzweil provides evidence that the neocortex is a self-organizing hierarchy of modules, each of which can learn, remember, recognize and/or generate a sequence, in which each sequence consists of a sequential pattern from lower-level modules. Longer relationships (between elements that are far away in time or spatial distance) are modeled by the hierarchy itself. In this work we adopt such a hierarchical structure, representing each sequential model as a feed-forward vector computation (with underlying sequences implicitly represented using n-grams). Whereas a long short-term memory (LSTM) network could also model such sequences, we don t need an LSTM s ability to directly encode long-term relationships (since the hierarchy does that) and LSTMs are much slower than feed-forward networks for training and inference since the computation scales with the length of the sequence. Similarly, the work on paragraph vectors shows that word embeddings can be back-propagated to arbitrary levels in a contextual hierarchy [13]. Machines can optimize sentence vectors, paragraph vectors, chapter vectors, book vectors, author vectors, and so on, with simple back-propagation and computationally efficient feed-forward networks. Putting a few of these ideas together, we wondered if we could predict a sentence using only the sum of its n-gram embeddings. Without the ordering of the words, can we use the limited sequence information from the n-grams, and the redundancy of language, to recreate the original word sequence? With a simple RNN as a decoder, our preliminary experiments showed perplexities of around 1.2 over a vocabulary of hundreds of thousands of words. A lot of the sequence information remains in this simple n-gram sentence representation. As a corollary, a hierarchy built on top of n-gram representations could indeed adequately represent the increasingly abstract sequences underlying natural language. Networks built on n-gram embeddings such as those presented in this paper (see section 4) are computationally inexpensive relative to RNN and convolutional network [6, 30] encoders. To make sure there is enough data and the necessary live feedback from users, we train on the anonymized Gmail data that was used in Kannan et al. [11], and use our models to give Smart Reply response suggestions to users of Inbox by Gmail (see figure 1). Smart Reply provides a real world application in which we can measure the quality of our response suggestion models. Just as in Kannan et al. [11], we consider natural language response suggestion from a fixed set of candidates. For efficiency, we frame this as a search problem. Inputs are combined with potential responses using final dot products to enable precomputation of the response side of the system. Adding deep layers and delaying combination between input and responses encourages the network to derive implicit semantic representations of the input and responses if we assume that the best way to predict their relationships is to understand them. We precompute a minimal hierarchy of deep feed-forward networks for all potential responses, and at runtime propagate only the input through the hierarchical network. We use an efficient nearest-neighbor search of the hierarchical embeddings of the responses to find the best suggestions. 2 PROBLEM DEFINITION The Smart Reply system gives short response suggestions to help users respond quickly to s. s are processed by the system according to the pipeline detailed in figure 2. The decision of whether to give suggestions is made by a deep neural network classifier, called the triggering model. This model takes various features of the received , including a word n-gram representation, and is trained to estimate the probability that the user would type a short reply to the input , see Kannan et al. [11]. If the output of the triggering model is above a threshold, then Smart Reply will give m (typically 3) short response suggestions for the . Otherwise no suggestions are given. As a result, suggestions are not shown for s where a response is not likely (e.g. spam, news letters, and promotional s), reducing clutter in the user interface and saving unnecessary computation. The system is restricted to a fixed set of response suggestions, R, selected from millions of common messages. The response selection step involves searching for the top N (typically around 100) scoring responses in R according to a response selection model P (y x). The output of response selection is a list of suggestions (y 1, y 2,..., y N ) with y i R ordered by their probability. Kannan et al. [11] used a sequence-to-sequence model for P (y x) and used a beam search over the Fig. 1. Our natural language understanding models are trained on data, and evaluated in the context of the Smart Reply feature of Inbox by Gmail, pictured here. new x Trigger suggestions? yes Response selection (y 1,..., y k ) Diversification (y i1,..., y im ) no No Smart Reply suggestions Response set R and clustering Fig. 2. The Smart Reply pipeline. A received is run through the triggering model that decides whether suggestions should be given. Response selection searches the response set for good suggestions. Finally, diversification ensures diversity in the final set shown to the user. This paper focuses on the response selection step. Smart Reply suggestions are shown prefices in R (see section 3). This paper presents a feed-forward neural network model for P (y x), including a factorized dot-product model where selection can be performed using a highly efficient and accurate approximate search over a precomputed set of vectors, see section 4. Finally the diversification stage ensures diversity in the final m response suggestions. A clustering algorithm is used to omit redundant suggestions, and a labeling of R is used to ensure a negative suggestion is given if the other two are affirmative and vice-versa. Full details are given in Kannan et al. [11]. 3 BASELINE SEQUENCE-TO-SEQUENCE SCORING The response selection model presented in Kannan et al. [11] is a long short-term memory (LSTM) recurrent neural network [8] an application of the sequence-to-sequence learning framework (Seq2Seq) [23]. The input x is tokenized into a word sequence (x 1,..., x m ) and the LSTM computes the conditional probability over a response sequence y = (y 1,..., y n ) as: P (y x) = P (y 1,..., y n x 1,..., x m ) = n i=1 P LSTM(y i x 1,..., x m, y 1,..., y i 1 ) where P LSTM is the output of the word-level LSTM. The LSTM is trained to maximize the logprobability according to P (y x) of the training data (a large collection of s and responses, see section 5.1). At inference time, likely responses from the candidate set R are found using a beam search that is restricted to the prefix trie of R. The time complexity of this search is O( x + b y ) where b is the beam width and should be scaled appropriately with R. This search dominates the computation of the original Smart Reply system. 4 FEEDFORWARD APPROACH Rather than learning a generative model, we investigate learning a feedforward network to score potential responses. Recall the goal of response selection is to model P (y x), which is used to rank possible responses y given an input x. This probability distribution can be written as: P (x, y) P (y x) = k P (x, y (1) k) The joint probability of P (x, y) is estimated using a learned neural network scoring function, S such that: P (x, y) e S(x,y) (2) Note that the calculation of equation 1 requires summing over the neural network outputs for all possible responses y k. (This is only an issue for training, and not inference since the denominator is a constant for any given x and so does not affect the arg max over y). This is prohibitively expensive to calculate, so we will approximate P (x) by sampling K responses including y uniformly from our corpus during training: P approx (y x) = P (x, y) K k=1 P (x, y k) (3) Combining equations 2 and 3 gives the approximate probability of the training data used to train the neural networks: e S(x,y) P approx (y x) = K (4) k=1 es(x,y k) The following subsections show several scoring models; how to extend the models to multiple features; how to overcome bias introduced by the sampling procedure; and an efficient search algorithm for response selection. 4.1 N-gram Representation To represent input s x and responses y as fixed-dimensional input features, we extract n- gram features from each. During training, we learn a d-dimensional embedding for each n-gram jointly with the other neural network parameters. To represent sequences of words, we combine n-gram embeddings by summing their values. We will denote this bag of n-grams representation as Ψ(x) R d. This representation is quick to compute and captures basic semantic and word ordering information. 4.2 Joint Scoring Model Figure 3a shows the joint scoring neural network model that takes the bag of n-gram representations of the input x and the response y, and produces a scalar score S(x, y). This deep neural network can model complex joint interactions between input and responses in its computation of the score. 4.3 Dot-Product Scoring Model Figure 3b shows the structure of the dot-product scoring model, where S(x, y) is factorized as a dot-product between a vector h x that depends only on x and a vector h y that depends only on y. This is similar to Deep Structured Semantic Models, which use feedforward networks to project queries and documents into a common space where the relevance of a document given a query is computed as the cosine distance between them [9]. While the interaction between features is not as direct as the joint scoring model (see section 4.2), this factorization allows us to calculate the representation of the input x and possible responses y S(x, y) = Wh S(x, y) = h x T h y h h x h y Ψ(x) Ψ(y) Ψ(x) Ψ(y) (a) A neural network that calculates a score between s and their responses. Rectified Linear Unit (ReLU) layers are used to reduce the (2d)-dimensional concatenation of the bag of n-gram representations to a scalar S(x, y). (b) Dot-product architecture, where a tower of tanh activation hidden layers encodes x to h x and a separate tower encodes y to h y such that the score S(x, y) is the dot-product h x T h y. Fig. 3. Feedforward scoring models that take the n-gram representation of an body and a response, and compute a score. independently. In particular, the representations of the response set R can be precomputed. Then searching for response suggestions reduces to encoding a new x in a simple feed-forward step to the vector h x, and then searching for high dot-product scoring responses in the precomputed set (see section 4.7). It is also efficient to compute the scores S(x i, y j ) for all pairs of inputs and responses in a training batch of n examples, as that requires only an additional matrix multiplication after computing the h xi and h yi vectors. This leads to vastly more efficient training with multiple negatives (see section 4.4) than is possible with the joint scoring model. 4.4 Multiple Negatives Recall from section 4 that a set of K possible responses is used to approximate P (y x) one correct response and K 1 random negatives. For efficiency and simplicity we use the responses of other examples in a training batch of stochastic gradient descent as negative responses. For a batch of size K, there will be K input s x = (x 1,..., x K ) and their corresponding responses y = (y 1,..., y K ). Every reply y j is effectively treated as a negative candidate for x i if i j. The K 1 negative examples for each x are different at each pass through the data due to shuffling in stochastic gradient descent. The goal of training is to minimize the approximated mean negative log probability of the data. For a single batch this is: J (x, y, θ) = 1 K = 1 K K log P approx (y i x i ) i=1 K K S(x i, y i ) log i=1 j=1 e S(xi,yj) (5) using equation 4, where θ represents the word embeddings and neural network parameters used to calculate S. Note that this loss function is invariant to adding any function f(x) to S(x, y), so S(x, y) = Wh h M i=1 hi h x M 1/M i=1 hi x S(x, y) = h T x h y h y M 1/M i=1 hi y S(x i, y) = W i h i h i h i x S(x i, y) = h i xt h i y h i y Ψ(x i ) Ψ(y) (a) Joint scoring model using multiple features of the input x i. A subnetwork scores the response using each feature alone, before the top-level hidden representations h i are concatenated ( M i=1 hi ) and then used to compute the final score. This is an application of the multi-loss architecture from Al-Rfou et al. [2]. i Ψ(x i ) Ψ(y) (b) Dot-product scoring model with multiple input features x i. This is a novel setup of the multi-loss architecture, whereby the featurelevel scores S(x i, y) and the final score S(x, y) are computed as a dot-product between the parallel input and response sides. i Fig. 4. Scoring models that use multiple features of the input . S(x, y) is learned up to an additive term that does not affect the arg max over y performed in the inference time search. 4.5 Incorporating Multiple Features There is structure in s that can be used to improve the accuracy of scoring models. We follow the multi-loss architecture of Al-Rfou et al. [2] to incorporate additional features beyond the message body, for example the subject line. Figure 4 shows the multi-loss architecture applied to both the joint and dot-product scoring models. The multi-loss networks have a sub-network for each feature of the , which are trained to independently score candidate responses using that feature alone. The highest level hidden layer of the sub-network is used in a final sub-network that is trained to combine the information from all the features and give a final score. This hierarchical structure results in models that learn how to use each feature faster than a network that sees all the features at once, and also allows for learning deeper networks than is otherwise possible [2]. Formally, denote the M features of an input x as x 1,..., x M. Then for each i, a subnetwork produces a hidden vector representation h i, and a score of the response y using only x i, S(x i, y). Denoting (x i 1,..., x i K ) as xi, a loss function J (x i, y, θ) encoura
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks