I looked at the Boundary dashboard for one of our production application servers and was shocked to see that in the list of nodes receiving traffic was a staging database! For the past several weeks we had been finalizing our new data intake pipeline, with the most recent effort being on monitoring and alerting. We had installed several different tools to look at system stats and custom application usage, one of which was Boundary. The system we were upgrading served to break our main read / write path dependency and use a high throughput proxy buffer in-between our product data store and our intake systems.
Like most modern services, the install was trivial; we just added a recipe to our Chef server and poof, our nodes began to show up immediately. The UI is decent, good enough for some tough situations so far, but the real time 1 second lag in data is incredible.
However, when we went to deploy our pipeline into production Boundary was invaluable. Taking what could have been a 3-4 day process for troubleshooting some gnarly configuration issues into something that took only a few hours.
Here are a two bugs we solved with Boundary:
1. Before we finalized the pipeline, we had deployed a proof of concept / alpha version into the production environment to see how it would hold up. Our plan was to build the new pipeline, divert the traffic and drain the old pipeline. Once empty it would be shut down. However, when we turned on the new feed, only some of the messages were coming into the pipeline. Our engineers immediately began to hunker down and see where we were dropping messages. While they were digging through logs and configs, I flipped open Boundary and looked and the farthest component upstream in our system, an HA Proxy load balancer.
I saw that traffic was being sent to both the old and new pipeline because Chef had pulled the new servers into the existing application pool. We had given them the same Chef role name! In case you are not familiar with Chef, suffice it to say this would have taking a long time to determine rather than the 2 minutes it took me to detect. Removing the old nodes from HA proxy then completed the task and we were again on our way.
2. Later we had all the data flowing through our intake system, but somehow the data was not reaching our customer facing product data store. We began the arduous process of looking for where in the system we were dropping messages. When you have a distributed system, this can be quite time consuming to comb through logs, understand errors and check configs.
Again, I checked out Boundary and this time noticed that our production system was sending traffic to our staging database. Errggggh, that’s not good. We noticed that one process that recently had been completed (but never ran in the test pipeline) had been deployed with the staging db configuration. The fix was quick, five minutes, but the find was quicker. I can only imaging how long this would have taken to figure out since there were no errors being generated and our application monitoring stats were properly incrementing!
Ok, enough cheerleading. Boundary is a great platform but only one part of our tool suite that we use to understand what our production systems are doing. Now, if I could only have NewRelic reach that 1 second frequency we’d really be cooking…