Posted Oct 7, 2020



Read time 6 mins

Whether you have just dipped your toes into the AI space with your first virtual assistant (chatbot) or if you have a full team of data scientists managing multiple integrated solutions in production, evaluating your solution to ensure optimal performance can mean the difference between a helpful virtual customer care agent and a bot that is frustrating to use.

Virtual assistants are iteratively developed conversational AI solutions that improve over time as you continuously train the model and increase its vocabulary. This requires both data governance practices and commitment from your team to review the logs, see where there are gaps in responses, and take a good look at the questions your users are actually asking and how they are asking them.

One of the main questions we are often asked is “what metrics can we use to make sure our bot is performing well?”. There are several angles we can take when evaluating the performance of a chatbot and some of our top recommendations are outlined below.

1. Usage 

From a high level, these metrics show how loved your bot is by your users. Here we can look at the total number of conversations and average conversation length to see how long users are conversing with the solution.

The recommendation here is to look at the average number of messages per conversation rather than the length of conversation in minutes as users may browse links and other pages while conversing with the solution. These metrics can be looked at both as a total/month or reviewed as time series analysis whereby you review the metrics over time.

This metric is important to understand how your users are interacting. Are they having lengthy long-winded conversations or are they just popping in, asking your hours, and carrying on with their day? This can help you shape future conversations and flows. When compared to your website visits or call-centre volumes, we can see the time saved by your customer care agents and then derive cost-savings from there.

2. Confidence

Stated in simple terms, this is a measure of how confident the bot is in the answer it is giving to your users. When we extract conversation logs from our solutions, we also extract the confidence level for each utterance (or question from our users). This lets us see which questions the bot was asked that we weren’t necessarily prepared for or need to improve the training around.

Utterances with low confidence can be added to the training of an already-existing intent or we can create a new intent to answer that question if it is unique.

Looking at the average confidence across all conversations for a period of time gives us a good idea of our bot is answering questions we have generally trained it to answer, or if our users are asking novel questions that we may want to include answers for in our solution.

3.User Satisfaction                           

Collecting direct user feedback from your solution is one of the most important metrics to look at, in my opinion. Even if your usage is good, and your confidence levels are acceptable, your users could be having a negative experience from your solution.

Having some way for your users to give feedback from your solution is critical to ensuring a successful solution and there are a variety of different ways this can be accomplished. You can randomly prompt the user for feedback throughout the conversation, ask them on a Likert scale their satisfaction at the end of the conversation, provide them with a comment box or even email a survey out after an interaction for more detailed feedback.

By collecting feedback, you can focus your efforts on the conversations where users had a negative experience and work to understand why they felt that way by reviewing their conversation logs and making improvements as needed.

4. Fallback Rate

Even with a large focus given on making your bot the best it can be, it is highly likely something unexpected or completely random will be asked at some point. In this metric, we are capturing how confused the bot gets an unknown utterance where it cannot understand what was asked.

Looking at this number not only gives you an idea of where the bot is getting stuck, but if you view the actual utterances which were marked as unknown you can use these to retrain the bot. Utterances marked as unknown can either be submitted to the model for an existing intent, used to make a new intent, or left as unknown or out of scope for your solution.

5. K-fold Test

A k-fold cross-validation test is another way to find where the bot is getting confused to improve the training. This metric measures the ground truth consistency (or how much confusion there is in your ground truth).

We use this metric for identifying opportunities to improve consistency in the ground truth and recommend starting with around 5 folds.

What this means is it identifies potential intent confusions in training utterances by repeatedly removing one-fifth of the training utterances, and treating it as a blind test. For each fold, we can look at the precision of the bot and the percentage of questions answered.

6. Blind Test

The blind test is another great way to assess the assistant’s performance. Unlike the k-fold test, no separate folds will be generated. For this test, a user supplies a file of utterances and their correct mapping to one of the intents (golden intent). The utterances are tested on the existing model measure if the predicted intent matches golden intent.

This test is a great regression test and compares the accuracy against the confidence of your chatbot. Intents with high confidence but low accuracy indicate that some utterances are mistrained and updating those utterances would increase your virtual agent’s accuracy.

As the model changes and you add to your solution, we can run this test to make sure it is still performing well and the model isn’t dipping or drifting in accuracy over time.

Line graphic of people standing in line

Our team of data scientists have a wealth of knowledge when it comes to designing, building, deploying and maintain AI conversational solutions over time.

These first few metrics just scratch the surface at some of the analyses you can complete to ensure the robustness and maintainability of your solution over time. We’d love to get in touch to chat about your analytics needs so feel free to reach out at any time.

Written By:

[email protected]

Mareena is a Data Scientist Consultant at Newcomp Analytics and an Adjunct Lecturer at the University of Toronto where she develops and implements AI-technologies and strives to fuel her students’ passions for technology advancement in healthcare.
Line graphic of a mountain

No matter where you are in your analytics journey, we'll guide you the rest of the way.

Animated Graphic: mountain-cloud
Consultation Form
First Name
Last Name
What Are You Interested In? *
Animated Graphic: mountain