Looking at Duolingo's Notification System in 2024

Summary of system

Duolingo's notification system is a complex and personalized system. It uses a variety of pre-written notifications for practice reminders and personalizes them based on factors such as the language being studied and user behavior. The system employs a custom AI to select the best notification for each user each day, aiming to help learners stay motivated and on track with their language learning. The notifications are based on a complex set of heuristics that involve modifying multi-armed bandits, a form of reinforcement learning, to optimize user engagement while also promoting learning5. This shows that Duolingo's notification system is designed to not only remind users to practice daily but also to encourage long-term engagement and learning. The system employs a custom AI to select the best notification for each user each day, aiming to optimize user engagement and promote learning. Additionally, the app leverages social influence, competition, and teamwork to encourage user engagement.

Basis of Notifications

Duolingo uses notifications to engage users on a daily cadence for an opted into language learning practice (also more recently mathematics). In their bandit system, each round corresponds to any notification that is sent to any user, and each arm corresponds to a template that can be chosen. These notifications are based on a modified multi-armed bandit algorithm that pulls from a series of hand-written templates which may contain slots for user-specific and personalized information. The use of a MAB for selecting the right template based on user behavior is common in recommender systems like this one, but Duolingo found success in modifying their algorithm to accommodate two observed differences in their system:

Fresh templates, or templates that haven’t been recently seen by the user, have a higher engagement rate (otherwise the MAB would converge on the single best-performing template in a vacuum)
Some templates are only appropriate when users are in specific states, with certain conditions met, such as risking the end of their streak (the gamification construct), or that the user has indicated it’s using Duolingo to prep for a trip to a country

Duolingo’s modifications

Duolingo modifies the pure Multi-Armed Bandit approach based on two related but distinct applications: the recovering bandits problem, the rested bandits problem, and the sleeping bandits problem. Recovering bandit says that the expected reward/outcome of an arm is based on how many rounds have passed since the arm was previously chosen. Rested bandit says that the expected reward/outcome of an arm is based on how many times the arm has been chosen previously. While the authors feel both problems are relevant, they weight the recency (recovering bandit) as more important than the number of occurrences (rested bandit), since timeframes extend over months and years for a Duolingo user. Sleeping Bandit accounts for the second observation above: that some arms might be ineligible during some rounds, meaning that some arms are conditional.

An interesting difference between forward’s engagement and Duolingo’s: many of their daily lesson completions, the purpose of their notification system, happens organically (although it’s impossible to know the true causality, or distinguish between organic engagement and notification-based, since someone might see the notification but not click through. This introduces noise). Whereas Forward’s engagement is majority inorganic, meaning we push someone and they engage with their app to accomplish the action.

Some of their learnings around experimentation and sample size leading to odd/degenerate behavior feel very relevant though: For large test groups, the authors generally see 90% perform within +/- 3.4%, normally distributed. For smaller sample arms, they saw only 50% of arms perform in that range, with 38% having scores +/-5% for their relative difference in performance, leading to that strange behavior in their bandit algorithm. Essentially, since our test groups are going to be small we should be cautious of false positive/false negative performance of a specific type of notification or template. The authors used Bayesian estimation toward a prior to accommodate

Takeaways:

Freshness and variance of content matter quite a bit when trying to achieve a goal, especially a homogenous goal. This means that if we need a user to log weight or something repetitive, leveraging different ways of communicating that prompt could increase success.
Personalization based on user behavior will increase success—for example, if someone habitually engages at specific times of day over others, tailoring our system to anchor that behavior will increase engagement.
Pooling templates for engagement centrally and injecting customized content into them is a successful model for engaging users across different use cases, leaning on psychological principles rather than just the content itself as a reason to engage.
Adapting a holdout set where an arm is chosen at random, as well as a softmax system, will accommodate adding new templates with no penalty to the system at large.

Back to profile