At MoPub, we serve billions of ad requests every day so our engineering team regularly encounters interesting challenges in their work. These challenges have pushed us to experiment with new technologies, optimize the effectiveness of our stack and in some cases, design entirely new systems. We are excited to share our engineers’ journey with our readers. Today, we are open-sourcing a project called MoMonitor . It has completely transformed how we do production monitoring at MoPub.
Momonitor is a Django app that allows for the creation of generic health checks. It also provides a simple WebUI for managing, creating, and alerting on checks for your production environment. It was originally created to replace MoPub’s Nagios setup.
How is MoMonitor different?
– Create powerful checks/alerts from a simple WebUI. Previous monitoring solutions that we have used at MoPub required too much effort for manipulating checks. For example, updating a check in Nagios requires editing Nagios configuration files inside of a Chef cookbook, and then restarting the Nagios processes. This is not conducive to ephemeral/cloud/scaling infrastructure. With MoMonitor, anyone can update checks right from the dashboard.
– Easily create new types of checks/alerts. MoMonitor provides developers with a generic interface for implementing new types of checks. To add a new type of check, all that needs to be done is implement the ServiceCheck Django model (provided in the code). At MoPub, we have already implemented 6 types of checks, each of which provide different monitoring perspective of our infrastructure. To demonstrate how to create check, we’ve put together a short video .
– Centralize visualization and alerts. MoMonitor acts as a centralized aggregation system for visualizing and alerting on metrics. As simple as this sounds, features like this are not easily available in our other monitoring systems. MoMonitor will show you (1) history of checks, (2) when the check was last run, (3) whether it succeeded or failed, (4) and why. When checks fail, MoMonitor will alert you either via PagerDuty or email. In addition, various options exist for tuning your alert configuration to help you minimize the amount of false positive alerts.
Conception of MoMonitor
We take uptime and system health very seriously, so a couple months ago we decided to invest in revising our monitoring system. Our goal was to put enough instrumentation in place such that we could quickly detect issues in our production environment while putting the onus on the engineers responsible for each service to keep checks current and coverage complete. At the time, production monitoring was exclusively handled by our operations team and was implemented using Nagios.
This wasn’t optimal for several reasons:
Configuring checks was a hassle. To update a check in Nagios, we needed someone to update our Nagios Chef cookbook, run chef-client on one or more machines, and then verify that the check was working correctly. Not to mention, there are multiple configuration files to update. We found this to be incredibly inefficient.
Nagios didn’t give us the transparency we required for our checks. While Nagios does include a dashboard, we found it difficult to convey why checks were failing and how often they were failing.
We needed a solution that would give engineers ownership over their systems’ production checks. We found it unscalable to have our Operations team understand every aspect of the infrastructure, including application logic. We needed a solution that would address the issues above, however, it also needed to be generic enough so it could support several different types of checks.
“Boat loads of metrics”
We already had a substantial amount of metrics to work with. We run collectd and diamond for system level metrics and statsd for application level metrics; these metrics are piped into Graphite for real-time graphing and data accumulation. We also use Sensu for running local checks on all of our machines. While we had more than enough data to report on, we found it difficult to visualize, organize, and make sense of it all.
To work with graphite data, we created a MoMonitor check that would integrate with the Umpire API. This API allowed us to specify health thresholds for all of our graphite metrics. At first, we explicitly defined these thresholds, but this didn’t work well because the traffic peak times were so much larger than during the quiet hours. To solve this, we implemented dynamic thresholding, which attempts to learn what the threshold should be over time. We found this to work very well.
We use Sensu for machine-level checks to replace similar functionality in Nagios. To integrate Sensu, we used the new Aggregates API that was released in version 0.9.9. This API allows you to determine which machines are failing certain sensu checks, aggregated across an entire cluster. For example, we have a sensu check to make sure all of our production machines in a particular cluster are running the same version of code; the aggregate API will tell us which machines are out of sync.
Use MoMonitor yourself!
We believe MoMonitor will help the community because it both provides an intuitive WebUI for managing your checks and a generic check type interface for implementing new types of checks.
To say the least, we are extremely happy with our results. MoMonitor has become an everyday tool at MoPub that all engineers depend on for production monitoring. When we encounter undetected issues in our production environment, we now have a tool that enables us to easily create new checks so that issue will never be undetected again.
We are actively working on MoMonitor to further improve and add new features. We’d love any feedback you have or help you to understand how it works. Check out the source code and several helpful resources we’ve put together to help you get started: