Amazon Prime Video - Using Bloom filters to reduce duplicates on prime video homepage.

Came through this blog about a senior Principle Engineer made fun of by his wife about Duplicate titles on prime video homepage. Explains how as engineers we see so many issues with products we build that makes us blind to them. Sometimes that external stimulus and customer anecdotes forces us to reevaluate our baselines and come up with solutions we always should have built.

This is a great example of where the success of each widget individually comes at a cost and popular titles can enter multiple widgets that aggregate information via different machine learning algorithms. Someone needed a high level view to consider overall experience from customer point of view. These things can be hard to find with less data during launches and only manifest themselves as features evolve over time. We can only hope to anticipate such issues proactively. It is better to be pragmatic and do the best solution we see them.

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False Positive results. False positive means, it might tell that given username is already taken but actually it’s not.

Interesting Properties of Bloom Filters

  • Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large number of elements.
  • Adding an element never fails. However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a positive result.
  • Bloom filters never generate false negative result, i.e., telling you that a username doesn’t exist when it actually exists.
  • Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices generated by k hash functions, it might cause deletion of few other elements. Example – if we delete “geeks” (in given example below) by clearing bit at 1, 4 and 7, we might end up deleting “nerd” also Because bit at index 4 becomes 0 and bloom filter claims that “nerd” is not present.

Evaluating the result

Bloom filter is a good structure as we are fine with occasional false positives. Even the use case can control amount of data it can enter, number of titles in N Pages in this case. It can control the amount of false positives and allows us to perform filtering with fixes size token passed around pages. Talking to Hani made me realise following:

  • Existing solutions in amazon retail were not applicable as prime video was trying to solve it across page loads.
  • Page loads per customer is a hard metric to instrument but 4 pages is something rarely done by customers.
  • Impact of increasing packet sizes we transmit is huge as when we cross the boundary of 1 packet then delays, error rates are all impacted.
  • Solving problems for 100% of customers is impossible for the amazon scale so if we can solve it for 99% without impacting others then it is better than not solving for anyone.

Optimising for better experience without impacting all other customers is a hard problem and requires co-ordination between multiple teams running game-days. Lots to learn and evaluate from this examples. On to next one tomorrow.