Skip to content

4: New formulations

The dialog understanding module plays a crucial role in enabling chatbots to understand user inputs and trigger the appropriate skills or functionalities. A skill represents a specific capability or task that the chatbot is designed to perform.

Dialog understanding can be broken down into two main components: intent detection and slot filling. Intent detection is responsible for identifying the user’s intended skill based on their input. Slot filling, on the other hand, extracts relevant values from the user’s input to populate the input parameters required for the skill.

For example, if a user inputs “I want to buy two tickets for the 8:00pm showing of Star Wars”, the dialog understanding module would classify this as a request to trigger the “buy_movie_ticket” skill. The slot filling component would then extract the relevant information, such as the quantity (2), the movie (“Star Wars”), and the time (8:00pm), and bind this information to the skill’s input parameters.

What formulations do we pick to solve these two problems are highly influenced by the technology made available by natural language understanding (NLU) research at the time. Given the recent progress made in the form of large language models, it is time for us to survey the research and see whether there are better formulations available. But first let’s review what we use today.

Discriminant Text Classification and Sequence Tagging

Before the multi-stage training using large language model (LLM) become popular, the dominant learning paradigm is discriminant method. Discriminant model are more sample efficient on a per task basis. The convex optimization problems resulted from these discriminant models permit a globally optimal solution. This means that theoretically, we can train the model once and achieve a high-quality solution. Finally the discriminant model are usually shallow in structure, so easy on computation budget. So naturally discriminant tools in form of text classification and sequence tagging are used to solve these two DU tasks.

For intent classification, we model the probability of intent given the user’s utterance, P(intent|utterance). For slot filling, we assume that the fillers do not overlap, and we model the probabilities of output sequences given the input sequence, simply tokenized user utterance, P([label…] |[token…] ). The output sequences consist of a sequence of labels used to describe whether a token is outside of an entity (O), or the beginning/inside of an entity of a given type (B/I-type). For example, the label sequence for the above utterance is: [O, O, O, O, B-quantity, O, O, O, B-showtime, B-movie, I-Movie], for the token sequence: [I, want, to, buy, two, tickets, for, the, 8:00pm, showing, of, Star, Wars]

The rigid nature of these discriminant models, where both model type (such as classification, or sequence tagging) and target (such as buy_movie_ticket or showtime, quantity and movie) are frozen during the training, any changes after that can be only be solved by restart training from scratch. The problems in the industry are not as cleanly defined, and also they change all the time, there will be new intent, new slots and classification errors. Since discriminant models are not production friendly as we liked, it is time for us to switch to some more flexible tools, like entailment, question answer.

Textual Entailment for Intent Detection

Textual entailment is a relatively new task in natural language processing (NLP) that aims to determine whether a statement (referred to as the “hypothesis”) can be logically inferred or implied from another statement (known as the “premise”). Unlike traditional text equivalence models, which focus on bi-directional text similarity, textual entailment models capture a one-way relationship between texts, determining whether hypothesis can be implied by the premise.

When the Stanford Natural Language Inference (SNLI) Corpus, a popular textual entailment dataset, was first introduced, the baseline accuracy is only around 50%. Since large language models, pretrained on the large quantity of longer texts, are exposed to instances of entailment of all shapes and sizes out there on the web, it is no surprise that LLM based method performs well on this task. For example, “Entailment as Few-Shot Learner” from Facebook AI team reports on 93% accuracy even with just few examples. Granted, the entailment logic presented in SNLI is simple, but the fact that we can reach that level of accuracy without elaborated labeling effort means that it is time for us to find the practical application for this research task.

Traditional intent classification can only understand user in literal sense. But human communication heavily rely on our ability to compute the implied meaning, for example, as a florist, you want to suggest get-well flowers based on user utterance “My mom is sick”, instead of ignore user’s hint. Textual entailment can be directly used to determine what we can do for user implied by their input: simply treat user utterance as premise and service description as hypothesis and pick the service that has highest score from model. In above example, with a good model, user utterance “My mom is sick” as premise should result in high enough score for hypothesis: “User want to buy flowers for his Mom”. Clearly textual entailment can simplify the down stream module such as dialog management towards providing natural and effective conversational experience to their user.

Question Answering for Slot Filling

Another relevant NLP task is extractive question answering. Given a passage and a question, the goal is to identify a span in the passage that can answer the given question.

SQuAD (Stanford Question Answering Dataset) is a large-scale dataset that consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding article. LLM based methods reached human level performance on SQuAD 1.0 quickly. To make it more practical, SQuAD 2.0, is developed to include questions that do not necessarily have a direct answer in the provided passage. But the extra requirement to detect the cases where there is no answer in the passage did not pose any challenge for LLM based methods at all, compared to a human level performance with 87% accuracy, the state of art single model performance is around 90%.

The methods developed for extractive question answering can be directly used to address the slot-filling subtask, where we need to find the user’s choices on the dimensions that services can be customized. This can be achieved by using the user’s utterance as the passage and creating a question for each slot to extract the value for that slot. For example, using user utterance “I like to have two tickets for 8:00pm Star Wars” as passage, and questions like “what time?” , “which moive?” and “how many tickets?”, we can easily extract the corresponding values for showtime, movie title and quantity with the any SQuAD solutions. Of course, the performance of prompt engineering can be sensitive to choice of prompt, in practice we want something more robust.

Parting Words

For a better developer experience, dialog understanding should be able to handle other tasks as well. For example, when users are asked to confirm their choice, sometimes they will not respond with a simple yes or no, but instead give a more elaborate feedback. For instance, when a bot asks a user, “Are you sure you want your burger to be extremely spicy?” the user might reply, “I eat jalapeños raw” instead of saying “yes”. The bot should be able to understand this in order to provide an efficient conversational user interface.

Building a separate understanding model from scratch requires labeling a large enough dataset. Doing this for each new skill seriously hinders the widespread adoption of chatbots. Composite models based on LLM have demonstrated their ability to beat human performance on many tasks under a few-shot setting. Therefore, instead of getting stuck with these old, rigid tools at the expense of developer productivity, it is time for us to look at the research on various NLP tasks and figure out how we can make use of them to build better our conversational user interface more affordably.

There are different ways in which LLM can be utilized, including fine-tuning, prompt engineering, and prompt tuning. What is the right approach for building dialog understanding? Stay tuned for the next installment of this series, where we will explain what we used in OpenCUI and why.


  1. The Stanford Natural Language Inference (SNLI) Corpus
  2. SQuAD2.0, The Stanford Question Answering Dataset
  3. Dialog Understanding Theory
  4. Towards Zero Shot