What do you want to filter our blog on?

  • Green cards and metrics

    Posted by on August 20th, 2014

    Quick story….last November I got married, went on honeymoon and when returning to the US (via Puerto Rico), realized that I had left my green card at home (guess I had other things on my mind). Several hours and multiple hundreds of dollars later, I was allowed to leave immigration at PR and return home – having been told that they had to cancel my green card because they could not let me in to the company without it.

    Now, I won’t begin to discuss the preposterous nature of this situation. You have all my records on file – my iris scan, my fingerprints, my passport, my countless trips out of the country over the last 13 years (and of course, many other things that I don’t even know about). I was even a member of the Global Entry Trusted Traveller program (after going through many background checks). But no, because I did not have that little piece of plastic the whole system broke down.

    I won’t bore you with the ensuing details apart from to say that it is now 9 months later, I have been to the immigration office many times and STILL don’t have my green card. Which means that...

    Show more...

  • Green cards and metrics

    Posted by on August 20th, 2014

    Quick story….last November I got married, went on honeymoon and when returning to the US (via Puerto Rico), realized that I had left my green card at home (guess I had other things on my mind). Several hours and multiple hundreds of dollars later, I was allowed to leave immigration at PR and return home – having been told that they had to cancel my green card because they could not let me in to the company without it.

    Now, I won’t begin to discuss the preposterous nature of this situation. You have all my records on file – my iris scan, my fingerprints, my passport, my countless trips out of the country over the last 13 years (and of course, many other things that I don’t even know about). I was even a member of the Global Entry Trusted Traveller program (after going through many background checks). But no, because I did not have that little piece of plastic the whole system broke down.

    I won’t bore you with the ensuing details apart from to say that it is now 9 months later, I have been to the immigration office many times and STILL don’t have my green card. Which means that every time I travel, I have to allow for about an extra hour or two to come through immigration because they always refer me to secondary, then eventually someone looks me up on a computer (imagine that) and says “OK, you’re free to go”. My family now comes through immigration separate to me because they are fed up waiting.

    Long story but I mention this purely because I was planning to stay in SF tonight with my wife/daughter but realized that I had an appointment with immigration early tomorrow morning in San Jose so instead I am at my house on my own, writing this post.

    It may of course all be a plan by our investors because when I am home alone, I tend to work and then work some more.

    Tonight it is metrics. I happen to love numbers – my kids think I am strange because of that and the rest of my family often refers to me with strange nicknames. But…and here is my question…..

    I would like to procure a simple to use, SaaS solution for metrics collection and reporting, where it has very easy to implement connectors to common tools that we use for our business….salesforce.com, Totango, Pardot, Quickbooks, Recurly, Desk.com etc.

    I don’t want a downloadable Windows package (SiSense), I don’t want to spend a small fortune (Domo), I don’t want something where I build my own connectors (GoodData) – please can I just get something that works and wake me from my excel nightmare!

    Add a comment or email me….I’ll personally send a personal bottle of wine to someone that recommends something that I end up actually using.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Customer success is all inclusive

    Posted by on August 19th, 2014

    I will use this blog to chat about life as the CEO of a startup. It will give insights into what really goes on, which is often (always?) very different than the marketing rhetoric.

    It might not always be enlightening, might not always be the best written posts, you might think it a complete waste of words but it will be real and honest.

    One of my areas of focus right now is to ensure that everyone at Boundary is constantly thinking about how to always be improving our customer experience.

    When I discuss with others on our team, they tell me that we must create defined programs and actions for individuals to take. I know that they are correct and I’m fortunate to be working with people that can implement my ranting, but what I really want to achieve is that everybody that works at Boundary simply has this as an underlying philosophy to everything they do. I don’t want us to wait to be asked, I want everyone to be proactive….see something that can be improved? then take action.

    We want our customers to never need to speak to us; of course we love communicating with our...

    Show more...

  • Customer success is all inclusive

    Posted by on August 19th, 2014

    I will use this blog to chat about life as the CEO of a startup. It will give insights into what really goes on, which is often (always?) very different than the marketing rhetoric.

    It might not always be enlightening, might not always be the best written posts, you might think it a complete waste of words but it will be real and honest.

    One of my areas of focus right now is to ensure that everyone at Boundary is constantly thinking about how to always be improving our customer experience.

    When I discuss with others on our team, they tell me that we must create defined programs and actions for individuals to take. I know that they are correct and I’m fortunate to be working with people that can implement my ranting, but what I really want to achieve is that everybody that works at Boundary simply has this as an underlying philosophy to everything they do. I don’t want us to wait to be asked, I want everyone to be proactive….see something that can be improved? then take action.

    We want our customers to never need to speak to us; of course we love communicating with our customers and we are constantly seeking feedback (how else do you learn?) but we want our products to be completely intuitive and always provide the answers to questions that the customer needs.

    Our customer success team works from the principle that we should never be asked the same question more than once. Either the product experience should be improved to ensure the question doesn’t need to be asked or docs should be updated. “How to” questions from customers are a huge opportunity to improve.

    But, the other area that I think might come as a surprise to some, is that this is an all-inclusive philosophy. It doesn’t matter whether you work in engineering, marketing, sales, customer success, product management, finance, operations, HR or anywhere else, every single person at Boundary can impact how our customers perceive us and therefore we each have a responsibility to play our part.

    A great product followed by incorrect invoicing can leave a bad taste. Misleading content on our web site can get the relationship off on the wrong foot. A support rep that commits to “get back to you tomorrow” and then takes off for the weekend is frustrating and annoying. A customer success rep that doesn’t return your email quickly makes you feel like a low priority.

    A customer said to me once “I know you must actually be really busy, but never once have you made us feel that you have anything else to do that is more important than we are”.

    That’s how I want our customers to feel.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Free Monitoring <3

    Posted by on August 7th, 2014

    When we talk to our customers there’s a few things we’ve heard above all else over the last several months: folks wanted a free offering, they wanted host level and generic metrics, and they wanted it all to be dead simple to setup and use. We listened – which is why we’re excited by this week’s release of free server monitoring. And early feedback has been fantastic.

    And it’s only going to get better from here. Sign up today and get your 10 free servers.

    10 Servers Free = Monitoring <3


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Free Monitoring <3

    Posted by on August 7th, 2014

    When we talk to our customers there’s a few things we’ve heard above all else over the last several months: folks wanted a free offering, they wanted host level and generic metrics, and they wanted it all to be dead simple to setup and use. We listened – which is why we’re excited by this week’s release of free server monitoring. And early feedback has been fantastic.

    And it’s only going to get better from here. Sign up today and get your 10 free servers.

    10 Servers Free = Monitoring <3


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Erlang MapReduce Queries, MultiFetch and Network Latency with Riak

    Posted by on June 25th, 2014

    I know, you’re looking at your calendar, let me be the first to assure you it’s not 2011. I recently had the need to write some Erlang MapReduce queries for Riak and it was a bit of an adventure. The Riak MapReduce documentation is good but generally focused on JavaScript. If you’re using Riak it’s quite possible you’ve never had the need to use its’ MapReduce capabilities. We haven’t really used it at Boundary before I dug into some performance problems and it’s probably not regarded as one of Riak’s strengths. With that said though it’s a nice feature and was worth some investigation.


    Slow code

    To provide a bit of context let me first describe the performance problem I was investigating. Boundary customers were experiencing poor response time from a service that is responsible for managing metadata for Boundary meters. The service is called Metermgr and it’s a webmachine/OTP application that relies on Riak for persistence and exposes meter metadata with a REST interface.

    I noticed that as the set of meters for an organization grew there appeared to be a simple regression in a...

    Show more...

  • Erlang MapReduce Queries, MultiFetch and Network Latency with Riak

    Posted by on June 25th, 2014

    I know, you’re looking at your calendar, let me be the first to assure you it’s not 2011. I recently had the need to write some Erlang MapReduce queries for Riak and it was a bit of an adventure. The Riak MapReduce documentation is good but generally focused on JavaScript. If you’re using Riak it’s quite possible you’ve never had the need to use its’ MapReduce capabilities. We haven’t really used it at Boundary before I dug into some performance problems and it’s probably not regarded as one of Riak’s strengths. With that said though it’s a nice feature and was worth some investigation.


    Slow code

    To provide a bit of context let me first describe the performance problem I was investigating. Boundary customers were experiencing poor response time from a service that is responsible for managing metadata for Boundary meters. The service is called Metermgr and it’s a webmachine/OTP application that relies on Riak for persistence and exposes meter metadata with a REST interface.

    I noticed that as the set of meters for an organization grew there appeared to be a simple regression in a certain queries response time. For queries with as little as 200 keys response time was between 2 – 4 seconds. After taking a look at the code I was able to pinpoint the cause of the slowdown to a function called multiget_meters. Unfortunately this function didn’t multiget anything rather it iteratively fetched them one by one, oof.



    Anyway, my initial thought was, “I’ll just use MultiFetch.”

    Does Riak support MultiFetch/MultiGet?

    If you’re familiar with the more popular Riak clients or search around the internet for “riak multiget” you might get the impression that Riak supports retrieving multiple values in a single HTTP or Protocol Buffers request, sometimes referred to as “multiget or multifetch”.

    Unfortunately that’s not the case, take a look at the source you’ll see that Riak itself doesn’t support these capabilities. Rather some Riak clients provide this functionality by parallelizing a set of requests and coalescing the results. The riak-java-client is one such example.



    Having had experience with the Java client I incorrectly assumed that the official Erlang client had a similar implementation but if you check out the source you’ll notice it doesn’t support MultiFetch. I did a bit of archeology and found there are a lot of posts with questions and requests around implementing multifetch in the Riak Erlang client. Most of these posts point the user towards using MapReduce. The most useful thread I could find on the subject can be found here, not surprisingly it is entitled multi-get-yet-again!


    MapReduce in Riak

    Implementing MultiFetch in Erlang wouldn’t be too difficult but several users reported very good performance using the MapReduce approach with the only caveat being:

    1. I heard MapReduce in Riak is slow (hearsay etc…).
    2. MapReduce queries in Riak clusters are run with a R=1.

    Unfortunately the latter is a serious problem and I would like to see it addressed but for now let’s disregard this as it’s outside the scope of the discussion. It’s fine, take him outside and show him the pool, get him a cookie, he’ll be fiiiiiiine, etc….

    The MapReduce docs on Basho’s website are pretty good but there’s a lot of data to sift through in order to find the most relevant pieces of information to get started quickly. After doing so though I’m pleased to say using Erlang MapReduce queries with Riak is quite easy and there’s really only 2 important pieces of information you need to know to get started.

    1. Riak has built-in Erlang MapReduce functions and you can use these to address many common use cases. You should learn how to use these first.
    2. You can write custom Erlang MapReduce functions but you need to compile and distribute the object code to all riak nodes.

    As noted in the docs the basic MapReduce function riakc_pb_socket:mapred/3 takes a client, a list of {Bucket, Key} tuples as input and a list of Erlang Queries. Let’s dig into the Query a bit more, it looks like the following

    {Type, FunTerm, Arg, Keep}
    
    Type - is an atom and is either map or reduce
    FunTerm - a tuple 
      for built-in functions use : {modfun, Module, Function}
      for custom functions use : {qfun, Fun}
    Arg - Static argument (any Erlang term) to pass to each execution of the phase
    Keep - True/False - Include results in the final value of the query
    

    The examples in the documentation focus heavily on writing your own qfun queries, though as I mentioned you can’t just use qfun without some upfront work, the documentation notes.

    Screen Shot 2014-06-25 at 6.44.01 PM

    In addition, there is another paragraph that in the section called “A MapReduce Challenge” that states.

    Screen Shot 2014-06-25 at 6.46.47 PM

    In summary, if you want to write custom MapReduce queries in Erlang you need to compile and distribute your code to Riak nodes. I’ve gotten so comfortable using erl as a REPL that I glossed over this and assumed I could simply pass functions references and they’d be evaluated. If you don’t take the time to read and fully understand the documentation you might skim past those qfun requirements and just start writing your own custom queries like me and this guy. Combine that with the fact that qfun MapReduce error messages are generally quite opaque and that can lead to a bit of frustration when getting started.

    I’d prefer the documentation break out the difference between built-in and qfun queries more clearly and focus on modfun examples initially with a separate qfun section, preferably with a big red callout yelling “Hey Dummy, don’t try this yet”. The JavaScript MapReduce API doesn’t suffer from this limitation of course because it’s JavaScript and is interpreted via the Spidermonkey JS engine that ships with Riak. Perhaps that and the recent popularity of JavaScript is why it is given much more attention in the docs.


    Simulating MultiFetch with Built-In MapReduce Queries

    So back to the point it’s best we understand the built-in queries before we go any further. Here’s a quick walk through of the default map functions that are provided.

    map_identity - Return a list of riak_object for each bucket/key
    map_object_value - Returns a list of values stored in each key (calls riak_object:get_value(RiakObject)) 
    map_object_value_list - calls riak_object:get_value(RiakObject) assumes get_value returns a list, returns a merged list
    

    There are reduce phases as well, but to achieve multifetch like capabilities we only need to concern ourselves with the map_object_value map function. We can achieve our original multifetch use case by substituting.

    for

    As expected a quick set of tests against the production cluster and we’ve reduced the query from 2 – 4 seconds down to an acceptable (albeit not blazingly fast) average of approximately ~115 milliseconds.


    Comparing to MultiFetch in Java

    These results of course got me thinking about how Erlang mapred would perform compared to MultiFetch in Java on the JVM and as such I decided it was worth gathering some data. I constructed a test for 20, 200, and 2000 keys (this is not a benchmark) and ran each of the 3 tests 100 times, gathered samples and calculated the average and variance. I ran the tests on a server in the same data center and on the same broadcast domain as the Riak cluster. As to be expected MultiFetch outperformed mapred and the latency of MultiFetch (as noted by Sean Cribbs and the Riak documentation) was more predictable.

    Response time in ms where network latency ranges between 0.1 – 0.4ms

    As the number of keys increased by orders of magnitude query response time becomes less predictable with both approaches though MapReduce’s variance is greater. Many raw samples with MapReduce fell within ~600ms but there also several samples between ~900ms and ~1400ms.


    When might MapReduce be faster?

    This had me wondering if there are any situations where MapReduce might be preferable to MultiFetch or should I always just use MultiFetch? It seems to be the prevailing sentiment, most in use by clients and even Basho sometimes seems reticent about suggesting the use of MapReduce. I decided to run the the same set of tests but this time I ran them from Metermgr running locally on my laptop connecting to the production Riak cluster over the VPN.

    Response time in ms where network latency ranges between 100 – 300ms

    While the results are somewhat expected they are interesting nonetheless. Initially with a key set of 20 MultiFetch overcomes the added network latency and outperforms MapReduce but as the key set grows by an order of magnitude the average MapReduce query time outperforms MultiFetch by a factor of 2. Average variance remains less predictable in MapReduce because adding network latency doesn’t affect the variance we experienced at sub-millisecond latency.

    We all know situating your application servers near your database is important for performance, but in an age of “hosted this and that”, “PaaS” and “DBaaS” as a developer you may end up using a database or service where network latency becomes a factor. In the above example using a MultiFetch approach network latency is compounded as the input set grows, whereas MapReduce takes that hit only once, hence the improved average response time.

    I would of course be remiss if I didn’t mention that Boundary is an exemplary tool to monitor performance of such different techniques and can provide 1 second resolution of average response time for Riak Protocol Buffer queries whether they are within the same data center or across the internet.

    Where to go from here?

    Well, I’ve got a solution for my performance problem that meets my near term needs. I’m interested into digging into alternative clients and seeing if a MultiFetch implementation for Riak exists in Erlang, if I don’t find one I like I will write my own. I also believe it’s incorrect to say “MapReduce in Riak is slow”, in fact under certain input constraints and configurations it is not only acceptable it is preferable to the MultiFetch approach, if latency predictability is not too much of a factor. The problem is more nuanced than “should I use MapReduce” and it’s more abstract than MapReduce and Riak. It is about read techniques and their performance within the constraints of a distributed system. There are problems and there are tools, we need to use the right tools to solve a problem in certain situations.

    I’m looking forward to digging into more custom Erlang queries and can already envision situations where Riak MapReduce might be favorable. Finally if you’re using Riak but haven’t dug into this custom MapReduce queries because you’re not comfortable with Erlang then it’s about time you learn you some.

    Special thanks to @pkwarren for peer review, without his grammatical support this post would be unreadable


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Web-Scale IT – “I know it when I see it…”

    Posted by on May 27th, 2014

    Recently at Boundary, we’ve been talking a lot about “Web-Scale IT”.  One of the first questions we usually get is,  “What exactly is Web-Scale IT?”  Cameron Haight of Gartner first coined this term in a 2013 blog  and said,  “What is web-scale IT?  It’s our effort to describe all of the things happening at large cloud services firms such as Google, Amazon, Rackspace, Netflix, Facebook, etc., that enables them to achieve extreme levels of service delivery as compared to many of their enterprise counterparts.”

    But when we answer this,  we are tempted to fall back on cliche.   In a famous opinion offered by Justice Potter Stewart in the 1964 case of Jacobellis vs. Ohio,  Stewart wrote;

    “I shall not today attempt further to define the kinds of material I understand to be (pornography)…But I know it when I see it…”

    That’s how we feel about Web-Scale IT, we have a hard time defining it, but we know it when we see it!

    We see it when we walk into an enterprise and hear more about the cloud than the datacenter.  We see it where release cycles are measured in weeks versus...

    Show more...

  • Web-Scale IT – “I know it when I see it…”

    Posted by on May 27th, 2014

    Recently at Boundary, we’ve been talking a lot about “Web-Scale IT”.  One of the first questions we usually get is,  “What exactly is Web-Scale IT?”  Cameron Haight of Gartner first coined this term in a 2013 blog  and said,  “What is web-scale IT?  It’s our effort to describe all of the things happening at large cloud services firms such as Google, Amazon, Rackspace, Netflix, Facebook, etc., that enables them to achieve extreme levels of service delivery as compared to many of their enterprise counterparts.”

    But when we answer this,  we are tempted to fall back on cliche.   In a famous opinion offered by Justice Potter Stewart in the 1964 case of Jacobellis vs. Ohio,  Stewart wrote;

    “I shall not today attempt further to define the kinds of material I understand to be (pornography)…But I know it when I see it…”

    That’s how we feel about Web-Scale IT, we have a hard time defining it, but we know it when we see it!

    We see it when we walk into an enterprise and hear more about the cloud than the datacenter.  We see it where release cycles are measured in weeks versus quarters.  We see it when tools like Chef are used for deployment.  We see it when we are talking to the head of DevOps.  Where there are sprints but not waterfalls.  Where the team is talking about continuous deployment, provisioning instances and open source components instead of next year’s release, hardware acquisition, and packaged software. When we see these things, we know we are seeing Web-Scale IT happening.

    The funny thing is, we see Web-Scale IT everywhere we look.  From the newest start-ups to the most conservative enterprises.  Web-Scale IT is not just for the Amazon’s, Google’s and Netflix’s of the world.  We see it at Fortune 500 insurance companies, health care companies and manufacturers.  At media companies, SaaS start-ups and service providers. In enterprises of every shape, size and flavor.

    Gene Kim, commenting on the adoption of DevOps in the enterprise, recently wrote in the CIO Journal,

    “The important question is why are they embracing something as radical as DevOps, especially given the conservative nature of so many enterprises? I believe it is because the business value of adopting DevOps work patterns is even larger than we thought. And those not transforming their IT organizations risk being left behind, missing out on one of the most disruptive and innovative periods in technology.”

    We couldn’t agree more. The confluence of Cloud, DevOps, Open Source, and competitive pressure have put us at a cross-roads in the history of Information Technology.  Web-Scale IT lets us build better applications, faster.  It lets us change them quicker.  And it lets us scale them more cost effectively and agilely.

    There is no doubt in our mind that Web-Scale IT is here to stay.  But Web-Scale IT is not without its challenges.  One of these challenges is ensuring high levels of service quality and delivery.  Boundary’s customers are some of the leading adopters of Web-Scale IT, whether they call it that or not.  We are excited to provide them a critical service  that helps them  successfully cope with the challenges of operating in this new, compelling environment, allowing them to anticipate and solve problems faster, and to keep up with the pace of application and infrastructure changes that are typical of Web-Scale implementations.

    So while it might not be easy to define Web-Scale IT, we know it when we see it, we are seeing it everywhere, and we are doing our best in helping our customers to make it deliver on its huge promise.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • A “Quantum Theory” of IT Monitoring

    Posted by on May 20th, 2014

    There are certain things, which are true in the quantum world, but just make no sense in our reality.  I remember in a college advanced physics course, having to calculate the likelihood that a baseball thrown at a window will pass through, emerge on the other side and leave both the ball and the window intact, due to quantum effects and tunneling.  I was astonished to see that while the odds of this happening are infinitesimally small, they are not zero.  Never mind the fact that you’d have to continuously throw the ball at the window, not accounting for breakage, for longer than the universe has existed to even have a remote chance of observing this, the odds are not zero and can be calculated.   And at the sub-atomic level, not the physical object one, this  type of behavior isn’t just common, it is expected.  This small fact has stuck with me for decades as a great illustration of how odd the quantum world truly is.

    What then does that possibly have to do with IT Monitoring?  It might be a stretch, but I think the new world of applications, which we call...

    Show more...

  • A “Quantum Theory” of IT Monitoring

    Posted by on May 20th, 2014

    There are certain things, which are true in the quantum world, but just make no sense in our reality.  I remember in a college advanced physics course, having to calculate the likelihood that a baseball thrown at a window will pass through, emerge on the other side and leave both the ball and the window intact, due to quantum effects and tunneling.  I was astonished to see that while the odds of this happening are infinitesimally small, they are not zero.  Never mind the fact that you’d have to continuously throw the ball at the window, not accounting for breakage, for longer than the universe has existed to even have a remote chance of observing this, the odds are not zero and can be calculated.   And at the sub-atomic level, not the physical object one, this  type of behavior isn’t just common, it is expected.  This small fact has stuck with me for decades as a great illustration of how odd the quantum world truly is.

    What then does that possibly have to do with IT Monitoring?  It might be a stretch, but I think the new world of applications, which we call Web-Scale, is in some ways as strange to traditional monitoring products as the world of Quantum behavior is to baseballs, windows and normal humans.

    Let me explain.   In the past, we built applications that were not quite so sensitive to small changes in infrastructure performance, for two main reasons.  First, our users had very low expectations.  From batch, to time sharing, to PC networks, to early web applications, we became accustomed to waiting for a screen to advance, an hour glass to spin, a web page to update.   But somewhere along the couple of years or so, our expectations have changed.  Movies stink when they stall, missed stock quotes can cost us real money, and we voraciously hang on our phones and tablets for real-time updates of everything from sporting events to natural disasters, to pictures and updates from loved ones, to new orders from customers.

    Second, we just got tired of the standard practice of over-provisioning data centers for peak loads, running at 50% capacity or less to ensure performance.  Despite falling hardware costs, our appetites for data and applications just kept growing.  So we virtualized everything, and when we tapped out the efficiency there, just like we stopped building power plants at office buildings decades ago, we went to the cloud, where we could “scale” on demand, and share the economies of scale of computing experts.

    Yet while the entire infrastructure changed, and the costs of performance delays and degradations increased, we happily kept monitoring things every five minutes or so, or even every hour checking for the same things we used to, capacity, resource utilization, and the like.  Yet  today users scream and customers leave over  5 second delays.  Outages of streaming information costs us money.  Our “quantam” of time we care about has shrunk dramatically to match the needs of the new application infrastructure, applications and user expectations.  We live in a real-time world, yet we continue to monitor our last architecture.

    Which brings me to another engineering theorem deep in my memory.   The Nyquist–Shannon sampling theorem, which in its simplest form, says that in order not to lose information, the sampling frequency that you measure at needs to be at least 2x as fast as the event you want to capture.  If any slower, your reconstructed signals suffers from “aliasing”, or loss of information.

    Today’s Web-Scale IT architecture and demanding users, care about changes and delays that last a few seconds, sometime even less.   If our quantum of caring, is now measured in a second or two,  Nyquist, and common sense, says we better be capturing  and processing monitoring data every second or so also.

    Last generation IT monitoring solutions simply CAN’T capture and process data fast enough.  They can stare all day at the baseball, but it will never tunnel through the window.  But unlike our quantum baseball example, the slow sampling of Infrastructure monitoring data leaves us blind to things that happen that we actually care about.  Stalled video, missed quotes, lost business opportunity, service delays and outages that cost us money.

    Our new math of IT monitoring needs to measure in seconds, it’s as plain and simple to see as the shattered window that I am staring at right now.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Dynamic Tuple Performance On the JVM

    Posted by on May 15th, 2014

    There are lots of good things about working on the JVM, like the HotSpot JIT, operating system threads, and a parallel garbage collector. However, one limiting factor can often be the interaction between primitive types and referenced types in Java. Primitive types are the built in types that represent integral numbers, floating point numbers, and boolean yes/no values. Primitives are memory efficient: they get allocated either on the stack if they’re being used in a method, or inlined in an object when they’re declared as field members. They also wind up being fast because the JIT can often optimize their access down to a single CPU instruction. This works really well when you know what types a class will hold as its state beforehand. If, on the other hand, you don’t know what an object or array will hold at compile time the JVM forces you to box primitives. Boxing means that the primitives get wrapped in a heap allocated object, and their container will hold a reference to them. That type of overhead winds up being inefficient both in access time and memory space. Access...

    Show more...

  • Dynamic Tuple Performance On the JVM

    Posted by on May 15th, 2014

    There are lots of good things about working on the JVM, like the HotSpot JIT, operating system threads, and a parallel garbage collector. However, one limiting factor can often be the interaction between primitive types and referenced types in Java. Primitive types are the built in types that represent integral numbers, floating point numbers, and boolean yes/no values. Primitives are memory efficient: they get allocated either on the stack if they’re being used in a method, or inlined in an object when they’re declared as field members. They also wind up being fast because the JIT can often optimize their access down to a single CPU instruction. This works really well when you know what types a class will hold as its state beforehand. If, on the other hand, you don’t know what an object or array will hold at compile time the JVM forces you to box primitives. Boxing means that the primitives get wrapped in a heap allocated object, and their container will hold a reference to them. That type of overhead winds up being inefficient both in access time and memory space. Access time suffers because this type of layout breaks locality of reference. The extra allocations and garbage generated also put pressure on the JVM’s garbage collector, which can often be a cause of long pause times.

    We wrote FastTuple to try and help solve this problem. FastTuple generates heterogeneous collections of primitive values and ensures as best it can that they will be laid out adjacently in memory. The individual values in the tuple can either be accessed from a statically bound interface, via an indexed accessor, or via reflective or other dynamic invocation techniques. FastTuple is designed to deal with a large number of tuples therefore it will also attempt to pool tuples such that they do not add significantly to the GC load of a system. FastTuple is also capable of allocating the tuple value storage entirely off-heap, using Java’s direct memory capabilities.

    FastTuple pulls off its trick via runtime bytecode generation. The user supplies it with a schema of field names and types. That schema is then built into a Java class definition which will contain accessor methods and either field definitions or the memory address for an off heap allocation, depending on which storage method was requested. The resulting Java class gets compiled into bytecode and then loaded as a reflective Class object. This Class object can then be used to create instances of the new class.

    Performance

    To understand the performance of FastTuple it’s necessary to have a good understanding of the relative cost of things on the JVM. To that end we wrote a microbenchmark in FastTuple to demonstrate the relative cost of writing and then reading several fields on a container, whether that be a Java object, an array, or a List. The code can be found here, and we’d love for you to run it on your own. The timings shown here are from a late 2013 Macbook Pro with a 2.6ghz Intel Core i7 running the 1.8.0_05-b13 build of Java 8.

    public long testClass() {
        Container container = new Container(0, 0, (short)0);
        container.a = 100;
        container.b = 200;
        container.c = 300;
        return container.a + container.b + container.c;
    }
    c.b.t.AccessMethodBenchmark.testClass     thrpt     1676855.039 ops/ms
    
    public long testLongArray() {
        long[] longs = new long[3];
        longs[0] = 100L;
        longs[1] = 200;
        longs[2] = (short)300;
        return longs[0] + longs[1] + longs[2];
    }
    c.b.t.AccessMethodBenchmark.testLongArray thrpt     1691027.650 ops/ms

    This is our baseline. If there’s a way to way to write to memory faster than this in Java, I don’t know of it. When we look at the assembly that eventually gets emitted by the JIT, it looks like this:

      0x000000010524e482: mov    0x60(%r15),%rax
      0x000000010524e486: lea    0x20(%rax),%rdi
      0x000000010524e48a: cmp    0x70(%r15),%rdi
      0x000000010524e48e: ja     0x000000010524e508
      0x000000010524e494: mov    %rdi,0x60(%r15)
      0x000000010524e498: mov    0xa8(%rdx),%rcx
      0x000000010524e49f: mov    %rcx,(%rax)
      0x000000010524e4a2: mov    %rdx,%rcx
      0x000000010524e4a5: shr    $0x3,%rcx
      0x000000010524e4a9: mov    %ecx,0x8(%rax)
      0x000000010524e4ac: xor    %rcx,%rcx
      0x000000010524e4af: mov    %ecx,0xc(%rax)
      0x000000010524e4b2: xor    %rcx,%rcx
      0x000000010524e4b5: mov    %rcx,0x10(%rax)
      0x000000010524e4b9: mov    %rcx,0x18(%rax)    ;*new  
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@0 (line 160)
    
      0x000000010524e4bd: movabs $0x64,%r10
      0x000000010524e4c7: mov    %r10,0x10(%rax)    ;*putfield a
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@15 (line 161)
    
      0x000000010524e4cb: movl   $0xc8,0xc(%rax)    ;*putfield b
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@22 (line 162)
    
      0x000000010524e4d2: mov    $0x12c,%esi
      0x000000010524e4d7: mov    %si,0x18(%rax)     ;*putfield c
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@29 (line 163)
    
      0x000000010524e4db: movabs $0x258,%rax
      0x000000010524e4e5: add    $0x70,%rsp
      0x000000010524e4e9: pop    %rbp
      0x000000010524e4ea: test   %eax,-0x225a3f0(%rip)        # 0x0000000102ff4100
                                                    ;   {poll_return}
      0x000000010524e4f0: retq

    The assembly is helpfully annotated with the corresponding java source in comments. The preamble is taking care of the allocation, the actual field writing takes only a handful of instructions, and then it cheats on the return side, moving the value 600 directly into RAX as the return value. The assembly emitted for testLongArray is almost identical, except it doesn’t cheat at computing the return value.

    Next down on the performance ladder is manipulating off-heap memory using a Sun JVM builtin class called Unsafe.

    public long testOffheapDirectSet() {
        unsafe.putLong(record2 + 0L, 100);
        unsafe.putInt(record2 + 8L, 200);
        unsafe.putShort(record2 + 12L, (short)300);
        return unsafe.getLong(record2 + 0L) + unsafe.getInt(record2 + 8L) + 
               unsafe.getShort(record2 + 12L);
    }
    testOffheapDirectSet         thrpt      948934.710 ops/ms
    
    public long testOffheapAllocateAndSet() {
        long record = unsafe.allocateMemory(8 + 4 + 2);
        unsafe.putLong(record, 100);
        unsafe.putInt(record+8, 200);
        unsafe.putShort(record+12, (short)300);
        long r = unsafe.getLong(record) + unsafe.getInt(record+8) + 
                 unsafe.getShort(record+12);
        unsafe.freeMemory(record);
        return r;
    }
    testOffheapAllocateAndSet    thrpt        7604.148 ops/ms

     

    So what’s going on here? In the first test all we’re doing is setting the memory for our three “fields” in a chunk of memory that’s been allocated outside of the benchmark. In the second test we’re doing the actual allocation in addition to setting the memory. The performance disparity can be explained by the way in which Unsafe is implemented; everything in Unsafe is native C++. But some of the methods are what’s known as intrinsics. On the JVM an intrinsic is more or less a macro that will get replaced with inlined assembly. This allows for native and potentially unsafe operations without the substantial overhead of making a JNI call.

    Unfortunately, Unsafe.allocateMemory is not an intrinsic, so it incurs the full overhead of a JNI call. This explains the performance disparity between testOffheapAllocateAndSet and testOffheapDirectSet. The performance different between testOffheapDirectSet and bare field manipulation, however is a bit more subtle. It’s true that the calls to putLong and friends get inlined, but the JIT cannot optimize them to the same degree as the raw Java code.

      0x0000000110b3ba29: and    $0x1ff8,%edi
      0x0000000110b3ba2f: cmp    $0x0,%edi
      0x0000000110b3ba32: je     0x0000000110b3baaa  ;*getstatic unsafe
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@0 (line 136)
    
      0x0000000110b3ba38: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@4 (line 136)
    
      0x0000000110b3ba3c: movabs $0x64,%rdi
      0x0000000110b3ba46: mov    %rdi,(%rax)
      0x0000000110b3ba49: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@19 (line 137)
    
      0x0000000110b3ba4d: movabs $0x8,%rdi
      0x0000000110b3ba57: add    %rdi,%rax
      0x0000000110b3ba5a: mov    $0xc8,%ebx
      0x0000000110b3ba5f: mov    %ebx,(%rax)
      0x0000000110b3ba61: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@36 (line 138)
    
      0x0000000110b3ba65: movabs $0xc,%rbx
      0x0000000110b3ba6f: add    %rbx,%rax
      0x0000000110b3ba72: mov    $0x12c,%edx
      0x0000000110b3ba77: mov    %dx,(%rax)
      0x0000000110b3ba7a: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@53 (line 139)
    
      0x0000000110b3ba7e: mov    (%rax),%rsi
      0x0000000110b3ba81: mov    %rax,%rdx
      0x0000000110b3ba84: add    %rdi,%rdx
      0x0000000110b3ba87: mov    (%rdx),%edi
      0x0000000110b3ba89: add    %rbx,%rax
      0x0000000110b3ba8c: movswl (%rax),%eax
      0x0000000110b3ba8f: movslq %edi,%rdi
      0x0000000110b3ba92: add    %rdi,%rsi
      0x0000000110b3ba95: movslq %eax,%rax
      0x0000000110b3ba98: add    %rax,%rsi
      0x0000000110b3ba9b: mov    %rsi,%rax
      0x0000000110b3ba9e: add    $0x50,%rsp
      0x0000000110b3baa2: pop    %rbp
      0x0000000110b3baa3: test   %eax,-0x2479a9(%rip)        # 0x00000001108f4100
                                                    ;   {poll_return}
      0x0000000110b3baa9: retq

    It’s unclear at this point which technique would win in a real life scenario. However, if you’re storing a massive amount of data off heap it is likely that the GC savings will more than pay for any performance degradation in accessing the data.

    The good news is that these tests appear to be our baseline for manipulating on heap and off heap memory within the JVM. Using these numbers we can reason about the overhead of the various layers of abstraction that are being introduced, and what the right tradeoff is for your application.

    Memory Alloc Access Throughput ops/ms
    Direct Allocate N/A 6956.274
    Direct Deque Indexed 146534.498
    Direct Pool Eval 49921.211
    Direct Pool Indexed 55483.808
    Direct Pool IndexedBoxed 36165.749
    Direct Pool Iface 55885.570
    Direct Prealloc Eval 314968.430
    Direct Prealloc Indexed 367886.412
    Direct Prealloc IndexedBoxed 102979.196
    Direct Prealloc Iface 347002.180
    Heap Allocate N/A 962680.613
    Heap Deque Indexed 170232.606
    Heap Pool Eval 49065.286
    Heap Pool Indexed 60376.541
    Heap Pool IndexedBoxed 38744.961
    Heap Pool Iface 60029.537
    Heap Prealloc Eval 392755.176
    Heap Prealloc EvalField 563205.486
    Heap Prealloc Indexed 509216.472
    Heap Prealloc IndexedBoxed 201109.726
    Heap Prealloc Iface 526641.511

    This table may look daunting, but it’s simply measuring the various combinations of features that can be used to access FastTuple instances and manipulate them. The key to decoding the results is as follows:

    • Direct – This means the tuple came from a TupleSchema configured to store data off heap.
    • Heap – The TupleSchema was configured for on heap allocation.
    • Allocate – The benchmark includes an allocation operation, both instance creation and any back allocation.
    • Deque – The tuple was taken from a simple j.u.ArrayDeque.
    • Pool – The tuple was taken from a TuplePool which involves both a j.u.ArrayDeque and a ThreadLocal lookup.
    • Prealloc – The tuple was passed in to the method preallocated.
    • IndexedBoxed – Access is via the boxed get and set methods in the FastTuple base class.
    • Indexed – Access is via the primitive getX and setX methods in the FastTuple base class.
    • Iface – Access is via an interface that the tuple was specified to implement.
    • Eval – Access is via an expression that was compiled into a dynamic class at runtime and then evaluated against the tuple.
    • EvalField – Only for the heap type. The expression is manipulating the tuple fields directly instead of calling accessor methods.

    With that in mind, what kind of conclusions can we draw from these benchmarks? For one, the indexed methods seem to do very well here. One thing to bear in mind, however, is that the indexes being given are specified as constants. In other words, we’re letting the program elide the step of figuring out the index, which might very well be expensive. I think it’s quite likely that in real world situations the best performance can be gleaned from using expressions to reify the runtime information about a tuple type into bytecode.

    Another conclusion is that parallelism has a real cost. As far as the processor is concerned there’s no substitute for having something sitting in a register ready to go. ThreadLocal is going to incur a lookup; behind the scenes there is a table mapping thread ID’s to their particular ThreadLocal variables. This lookup has a cost, and it’s currently one of those things that the JIT can’t look past and elide. That’s why FastTuple is so configurable. In order to get the best performance in your situation you need a certain amount of flexibility about the lifecycle and access capabilities of these tuples. So give FastTuple a try, or better yet fork it and submit a patch.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Meter 2.0 – Build Methodology

    Posted by on March 24th, 2014

    In our earlier “Boundary Meter 2.0 – Foundations” post, Brent included a section of discussion on how we currently build our meter software.  Boundary’s customers run a multitude of operating systems (and various versions of those operating systems) and several different CPU architectures, which equates to effort and resources on Boundary’s engineering team to provide all the necessary meter variants.  In addition to our current set of customers, the Boundary sales team perpetually engages with potential new customers, some of which have OS, OS version, and/or architecture requirements which we don’t currently have a meter build for.

    In this post, we’ll explore the meter building process at Boundary and how it satisfies our business requirements while not overburdening engineering team resources.

    Consistency

    Because Boundary’s meter is written in C, it requires proper compilation and linking to create a working executable for each platform we support.  In this situation, one approach for supporting multiple OSes is to use the usual/native development toolchain for each OS....

    Show more...

  • Boundary Meter 2.0 – Build Methodology

    Posted by on March 24th, 2014

    In our earlier “Boundary Meter 2.0 – Foundations” post, Brent included a section of discussion on how we currently build our meter software.  Boundary’s customers run a multitude of operating systems (and various versions of those operating systems) and several different CPU architectures, which equates to effort and resources on Boundary’s engineering team to provide all the necessary meter variants.  In addition to our current set of customers, the Boundary sales team perpetually engages with potential new customers, some of which have OS, OS version, and/or architecture requirements which we don’t currently have a meter build for.

    In this post, we’ll explore the meter building process at Boundary and how it satisfies our business requirements while not overburdening engineering team resources.

    Consistency

    Because Boundary’s meter is written in C, it requires proper compilation and linking to create a working executable for each platform we support.  In this situation, one approach for supporting multiple OSes is to use the usual/native development toolchain for each OS.  But this approach requires familiarity with each dev toolchain being used and can be difficult to automate in a consistent fashion across all build environments.  And certainly this approach can grow more complex over time as new OSes are added.

    To accommodate our need for supporting an ever-changing list of supported OSes/architectures, the Boundary meter build centers around the GNU build tools (a.k.a. Autotools).  This set of tools provides a big value-add by allowing us to maintain a common, consistent build methodology across all meter builds.  For any particular meter we want to build, it’s as simple as navigating to the top-level source directory and running “make” to crank a new meter out.  Even building the meter for a Microsoft Windows client works in the same fashion!  This approach also lends itself well for consistency in automating.

    Speaking of our Windows client meter build, certain builds will require some additional effort to setup appropriate cross-compilers or emulated environments.  We currently use an Ubuntu (amd64) system to execute our Windows client meter builds as well as the meter builds for ARM architectures.  MinGW provides a cross-compiler that allows us to use our Ubuntu build system for creating a proper Microsoft Windows executable meter binary.  For our ARM-architecture armel and armhf builds (which we currently support for both Ubuntu’s Precise Pangolin and Debian’s Wheezy releases), we leverage the machine-emulation capability of QEMU, making it possible to use the native (not cross-compiled!) ARM toolchain for compiling and linking the meter software.  Additionally, we take advantage of the pdebuild utility for our Debian and Ubuntu builds (including both ARM builds), which simplifies download and installation of the appropriate toolchain and environment.

    Keep it Simple

    In many situations, it requires some level of additional effort to make aspects/areas of a project “simple to use and maintain” as opposed to just “cranking it out”/”making it work”.  Sometimes just a bit of up-front planning/thinking is the only additional effort required.  And sometimes it does payoff to “go for simple”!

    One technology we’ve utilized to simplify our meter software builds are Virtual Machines, specifically QEMU (with KVM).  This allows us to have one physical server with multiple guest OS VMs.  In our case, an Ubuntu (amd64) system is the host OS on our physical build server, and we run virtual instances of FreeBSD, Ubuntu, SmartOS, OpenSUSE, and Gentoo for handling our builds.  This leaves us with only one physical system to admin/maintain, while providing us the flexibility to easily add additional OSes (i.e. creating new VMs) when needed.

    Keeping our number of build VMs to a minimum is another way we’ve simplified.  Because our meter is statically linked with many of the libraries it requires, the meter built under Ubuntu will execute correctly on a number of other Linux distributions we support.  To satisfy our customers who use Red Hat Enterprise Linux (RHEL), CentOS, and Fedora, we offer a Boundary meter RPM package file which is actually created within our Ubuntu build VM using the mock utility.  This approach removes the need for an additional build VM of RHEL (or CentOS or Fedora) to create RPM packages.

    Package Accordingly

    In order to meet our customers’ expectations for proper installation, configuration, and removal of Boundary meter software, we provide an appropriate meter package for each platform we support.  This means providing a Debian package for Ubuntu and Debian Linux distributions, an RPM package for Red Hat, CentOS, Fedora, and OpenSUSE Linux distributions, a properly-signed Windows MSI file for Microsoft Windows, etc..

    While the majority of Boundary’s meter packages are created using usual and well-documented methods, the process we use for packaging the Microsoft Windows meter is interesting in that it’s done entirely under Linux!  Using WINE on our Linux build VM, we can execute the Windows binaries required to create a nice, properly-signed MSI file. Some of these Windows binaries we use include:

    • Microsoft’s HTML Help compiler for creating the usual “compiled help” (.chm) file for Windows
    • WiX for creating the MSI file (we also use wixwine‘s wrappers of the WiX binaries to simplify our execution of those binaries via WINE)

    Once we have a MSI file, we give it a valid signature using the OpenSSL-based signcode utility (a.k.a. osslsigncode).  Now we have a valid package for installation on Microsoft Windows!

    Automate

    For companies with limited engineering resources, automation can be a very valuable player in the build process.  Considering the many combinations of OS/version/architecture our meter software must support and provide packages for, we have build-and-package scripts that are used by Jenkins, a continuous integration tool, to help keep things manageable for our engineering team.

    Our build-and-package scripts provide a simple, command-line mechanism to build the meter software and create the associated package file.  This makes it easy for an engineer or QA team member to generate a meter package (while reducing the potential for mistakes).

    Giving Jenkins the ability to use our build-and-package scripts is where things get interesting.  We define a handful of “jobs” in Jenkins, and these tell Jenkins how to invoke the build-and-package scripts (along with which build VM(s) are valid for Jenkins to login and use when building a particular meter and package). Once created, these Jenkins jobs can be scheduled to run automatically and can also be started from one click in a web browser, making it easy for folks outside of engineering and QA to generate a meter package.  We also create a top-level job which just invokes all the meter jobs, making it easy to build-and-package all supported meters with a single click.

    Additionally, to help keep the number of jobs manageable, we utilize the “Configuration Matrix” (with “Combination Filter”) job feature of Jenkins.  This allows us to use a single Jenkins job for executing many different meter builds (as opposed to creating many separate jobs to cover all these builds individually).  Here’s what the “Configuration Matrix” looks like for our Debian/Ubuntu build job, reflecting the 12 valid (blue dots) distribution and architecture combinations this Jenkins job will handle:

    jenkinsjobmatrix

    As you can see, this job will build Boundary meter packages (both x86 32-bit and 64-bit) for five Debian and Ubuntu distros, as well as armel and armhf packages for Debian’s Wheezy and Ubuntu’s Precise distros.  Those combinations with a gray dot are not supported and will not be built by this job.

    Concluding Thoughts

    We’ve discussed a number of build-related strategies and practices the Boundary engineering team employs in the meter software development and release cycles.  While these practices usually involve some small level of effort (e.g. adding Autotools support for a new build, adding packaging creation for an OS other than the host OS, automating tasks/builds), they definitely facilitate our ability to enhance and maintain the Boundary meter software in a way which satisfies the business requirements (and customer expectations) while not overburdening engineering’s resources.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Meter 2.0.3 adds STUN-ing new features!

    Posted by on March 17th, 2014

    Boundary Meter 2.0.3 was just released and now includes STUN support. By using STUN, the Boundary Meter can automatically discover it’s public IP address even when it’s behind a firewall or NAT device.

    Once Boundary knows the public IP address, it can use it to correlate public and private network flows. For instance, if two servers connect to each other via a proxy, Boundary can use the public IP information to assemble the two independent flows on either side of the proxy into a single conversation.

    Ultimately, this provides deeper insight into how servers and virtual instances are communicating. This is always helpful when troubleshooting performance problems. Below is a screenshot of a meter with both public and private IP addresses.

    Boundary Meter View with Box v2

    In addition to STUN support, Boundary Meter 2.0.3 also includes the following  highly-valuable enhancements::

    • Ability to enable promiscuous mode for packet capture
    • Option to disable the built-in NTP client
    • Support for running the meter on Linux Mint

    See the release notes for more information about these features and full list of bug...

    Show more...

  • Boundary Meter 2.0.3 adds STUN-ing new features!

    Posted by on March 17th, 2014

    Boundary Meter 2.0.3 was just released and now includes STUN support. By using STUN, the Boundary Meter can automatically discover it’s public IP address even when it’s behind a firewall or NAT device.

    Once Boundary knows the public IP address, it can use it to correlate public and private network flows. For instance, if two servers connect to each other via a proxy, Boundary can use the public IP information to assemble the two independent flows on either side of the proxy into a single conversation.

    Ultimately, this provides deeper insight into how servers and virtual instances are communicating. This is always helpful when troubleshooting performance problems. Below is a screenshot of a meter with both public and private IP addresses.

    Boundary Meter View with Box v2

    In addition to STUN support, Boundary Meter 2.0.3 also includes the following  highly-valuable enhancements::

    • Ability to enable promiscuous mode for packet capture
    • Option to disable the built-in NTP client
    • Support for running the meter on Linux Mint

    See the release notes for more information about these features and full list of bug fixes.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Introducing the new Boundary User Interface

    Posted by on February 27th, 2014

    Our sole focus at Boundary is to make it easier for IT Ops to troubleshoot and resolve problems. When we redesigned the Boundary User Interface (UI) our goal was to make it easier and faster for IT Ops to find the information needed to resolve IT outages and diagnose performance problems.

    One of the first improvements you’ll notice in the UI is the new navigation model based around the “filter bar.” The filter bar lets users set the “time range” and “source” that is used to filter the data shown in the view. Below is a screen shot of the new UI and its three main components: the “filter bar” highlighted in red, the “view” highlighted in yellow and the “navigation bar” highlighted in orange.

    Filter Bar 4

    The filter bar is especially helpful when a user is trying to investigate a problem. Typically, a user will start with the events view to see the event details for the specific time range related to a group of servers or virtual machines. Once a user has an understanding of those events they can quickly move to the streams view and start examining the flow data...

    Show more...

  • Introducing the new Boundary User Interface

    Posted by on February 27th, 2014

    Our sole focus at Boundary is to make it easier for IT Ops to troubleshoot and resolve problems. When we redesigned the Boundary User Interface (UI) our goal was to make it easier and faster for IT Ops to find the information needed to resolve IT outages and diagnose performance problems.

    One of the first improvements you’ll notice in the UI is the new navigation model based around the “filter bar.” The filter bar lets users set the “time range” and “source” that is used to filter the data shown in the view. Below is a screen shot of the new UI and its three main components: the “filter bar” highlighted in red, the “view” highlighted in yellow and the “navigation bar” highlighted in orange.

    Filter Bar 4

    The filter bar is especially helpful when a user is trying to investigate a problem. Typically, a user will start with the events view to see the event details for the specific time range related to a group of servers or virtual machines. Once a user has an understanding of those events they can quickly move to the streams view and start examining the flow data statistics to understand how the network is being impacted. With Boundary’s new navigation model, the process is simple because the filter bar preserves the troubleshooting context as user’s move around the application. Hopefully, the usability flow is so natural that you don’t even notice it as you move around Boundary.

    One other subtle but important change in the UI was moving the navigation bar to a vertical column on the left side (see orange box above). Several of our beta users said they needed more vertical space when they worked on small monitors. By making the navigation bar vertical, we were able to free up precious vertical space giving users a much better experience on small monitors.  I want to thank all of our beta users as feedback like this really helps us improve Boundary. For more release details visit our whats new page.

    We hope the new Boundary UI lets you find and resolve problems faster and we would really appreciate your feedback. Please send any comments or questions directly to brandon@boundary.com.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Wattpad uses Boundary and AWS to help scale storytelling platform

    Posted by on February 20th, 2014

    Growing social platform anticipates performance issues and fixes problems faster with Boundary.

     

    Mountain View, California — February 20, 2014 Boundary announces that Wattpad, the world’s largest community of readers and writers, is using its service to monitor and improve performance on its Amazon cloud-hosted infrastructure. In the last two years, Wattpad has experienced explosive growth with 20 million people joining the community. “Our tremendous growth stems from the fact that we’re offering readers and writers something they’ve never had before—a direct connection with each other,” says Charles Chan, head of engineering at Wattpad.

    With marketplace traction, however, Wattpad needed comprehensive strategies for performance. The company hosts its website 100% on AWS public cloud, and it needs early insight into anomalies. While Wattpad deploys industry best practices including switching between different AWS zones for optimal reliability, the engineering team is always looking for better visibility into system hotspots and unplanned downtime.

    Wattpad uses...

    Show more...

  • Wattpad uses Boundary and AWS to help scale storytelling platform

    Posted by on February 20th, 2014

    Growing social platform anticipates performance issues and fixes problems faster with Boundary.

     

    Mountain View, California — February 20, 2014 Boundary announces that Wattpad, the world’s largest community of readers and writers, is using its service to monitor and improve performance on its Amazon cloud-hosted infrastructure. In the last two years, Wattpad has experienced explosive growth with 20 million people joining the community. “Our tremendous growth stems from the fact that we’re offering readers and writers something they’ve never had before—a direct connection with each other,” says Charles Chan, head of engineering at Wattpad.

    With marketplace traction, however, Wattpad needed comprehensive strategies for performance. The company hosts its website 100% on AWS public cloud, and it needs early insight into anomalies. While Wattpad deploys industry best practices including switching between different AWS zones for optimal reliability, the engineering team is always looking for better visibility into system hotspots and unplanned downtime.

    Wattpad uses several tools for infrastructure monitoring, yet the company didn’t have a consistent method of tracking network bandwidth usage and traffic patterns. The engineering team began hunting for a new toolset, and determined that Boundary could fill the gap. “With Boundary, we would have been able to pinpoint performance issues much faster including determining if certain availability zones were in trouble.”

    Since deploying Boundary’s cloud-based consolidated operations management software, the company has found additional benefits beyond proactively monitoring uptime on AWS. Wattpad used Boundary to isolate an issue within the search application that caused a system outage.  Wattpad also used Boundary to identify possible breaking points in the website to prepare for 2013 holiday season peak traffic. “Boundary gives us an edge because we can constantly monitor the network traffic across all nodes within the system, and magnify issues that need to be handled quickly,” Chan says. “We can also anticipate the impact of changes as we scale and locate areas for optimizing the website. The Boundary staff worked with us over a period of several months to demo the system in our environment and they’ve always been there to help us quickly when we needed it.”

    “Wattpad’s business is defined by high volumes of daily users and constantly updated data streams and it depends on the scalability and flexibility it gets with AWS,” says Gary Read, CEO at Boundary. “Our cloud-based operations monitoring software is designed to monitor dynamic and always-changing environments like Wattpad. We are excited to help this innovative social platform grow and succeed around the world, and in the cloud.”


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Meter 2.0.2 Release Candidate Available

    Posted by on February 4th, 2014

    The latest update to the Boundary Meter, version 2.0.2, is available for general testing. This release adds support for two new operating systems: Mac OS X and OpenSUSE.

    The installation packages for Windows, Linux and SmartOS have been improved to provide easier access to various meter installation options, such as tagging and custom host names.

    Enhancements to the flow engine lower memory and CPU consumption during Denial of Service (DoS) attacks and other high traffic scenarios. Because the meter generates IPFIX flow records for incoming traffic, a DoS attack would generate a correspondingly large amount of statistical traffic. This extra traffic could then snowball to actually worsen the effects of the attack.  We worked closely with customers who have experienced DoS attacks ( see a previous blog article: DNSimple shares how Boundary “saved our bacon” during a sudden disruption in service.) to identify the root causes and mitigate them.

    As a result, the flow engine can now identify and prevent such attacks from generating too much IPFIX data.  Among mitigated...

    Show more...

  • Boundary Meter 2.0.2 Release Candidate Available

    Posted by on February 4th, 2014

    The latest update to the Boundary Meter, version 2.0.2, is available for general testing. This release adds support for two new operating systems: Mac OS X and OpenSUSE.

    The installation packages for Windows, Linux and SmartOS have been improved to provide easier access to various meter installation options, such as tagging and custom host names.

    Enhancements to the flow engine lower memory and CPU consumption during Denial of Service (DoS) attacks and other high traffic scenarios. Because the meter generates IPFIX flow records for incoming traffic, a DoS attack would generate a correspondingly large amount of statistical traffic. This extra traffic could then snowball to actually worsen the effects of the attack.  We worked closely with customers who have experienced DoS attacks ( see a previous blog article: DNSimple shares how Boundary “saved our bacon” during a sudden disruption in service.) to identify the root causes and mitigate them.

    As a result, the flow engine can now identify and prevent such attacks from generating too much IPFIX data.  Among mitigated attacks are SYN/FIN floods, UDP and ICMP floods. There is additionally a new limit to the amount of data that can be waiting in memory to be sent to Boundary at a given time. This prevents the meter from using too much memory on a server when outbound bandwidth is restricted.

    More IP protocols , such as ICMP and IGMP, are supported as well. Previous meter releases would forward flow info for UDP, TCP and SCTP flows. 2.0.2 forwards statistics on all IP protocol traffic.

    If you are a current Boundary user, check out the full release notes. If not, check out Boundary and sign up for a free trial.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Surpasses 400% YoY Growth in Processing of Massive IT Operations Performance Analytics in the Cloud

    Posted by on January 28th, 2014

    After chatting and exchanging emails with some analysts we decided it was time to publish an amazing internal statistic. We then shared our story with Ben Kepes (@benkepes) resulting in Forbes publishing a great article on us HERE.

    The Boundary service is processing an average of 1.5 trillion application and infrastructure performance metrics per day on behalf of its clients and has  computed occasional daily bursts of over 2 trillion metrics. The daily average represents a 400% year-over-year increase driven by the growing needs of customers to deliver high quality application performance, find problems faster and avoid unplanned downtime. The fastest growing segment of data processed by Boundary, with more than 500 billion metrics per day (a 600% growth), comes from Amazon Web Services (AWS) and reflects clients’ expanded use of and confidence in the cloud infrastructure for transitioning legacy enterprise applications and building new ones when paired with the unparalleled visibility Boundary provides.

    Then we started to think about what this meant and wondered what...

    Show more...

  • Boundary Surpasses 400% YoY Growth in Processing of Massive IT Operations Performance Analytics in the Cloud

    Posted by on January 28th, 2014

    After chatting and exchanging emails with some analysts we decided it was time to publish an amazing internal statistic. We then shared our story with Ben Kepes (@benkepes) resulting in Forbes publishing a great article on us HERE.

    The Boundary service is processing an average of 1.5 trillion application and infrastructure performance metrics per day on behalf of its clients and has  computed occasional daily bursts of over 2 trillion metrics. The daily average represents a 400% year-over-year increase driven by the growing needs of customers to deliver high quality application performance, find problems faster and avoid unplanned downtime. The fastest growing segment of data processed by Boundary, with more than 500 billion metrics per day (a 600% growth), comes from Amazon Web Services (AWS) and reflects clients’ expanded use of and confidence in the cloud infrastructure for transitioning legacy enterprise applications and building new ones when paired with the unparalleled visibility Boundary provides.

    Then we started to think about what this meant and wondered what other big data processing numbers were out there? We turned of course to handy dandy google and found that by comparison, NASDAQ processes 2 billion trades daily, Facebook gets 4.5 billion “likes” per day, and Twitter sees 500 million tweets daily. In the words of our CEO Gary Read, “We have completely disrupted the legacy IT Operations software model by removing the need and cost of scoping, procuring, deploying and managing hardware that would typically consume countless hours and other valuable resources… Most of the time spent fixing an application or infrastructure problem is focused on finding the source, and that’s exactly where Boundary helps customers.”

    It’s always fun passing a milestone, but of course it’s more fun when customers also have good things to say. For example:

    scripps“Boundary’s real-time network layer topology has allowed us to provide instantaneous operational value for our AWS implementations,” says Allen Shacklock, lead cloud architect at Scripps Networks Interactive, owner of brands such as the Food Network, HGTV and the Travel Channel.  “The benefit of using a SaaS provider to visualize traffic patterns and identify problem areas quickly allows for teams to proactively solve issues without taking on additional management tasks.”

     

    To check out the full release and more customer quotes simply click HERE.

    BTW. Boundary is Hiring. Our Jobvite page is HERE.

    Take your own Test Drive. Sign Up NOW

     


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Meter 2.0 – Design

    Posted by on September 27th, 2013

    C is a very simple and powerful language. Because it does not enforce a particular style or organization of code, it can also be very dangerous and easy to write spaghetti code. As they say, it gives you just enough rope to shoot yourself with. However, with careful design, a C project can be grown from a small tech demo to a large application without too many growing pains.

    Object Oriented Design in C

    For the Boundary Meter 2.0 project, I chose to use an object-oriented style of C programming. Though C is not necessarily an OO language, aspects of OO design are still useful within a C project of any size. The main principals that the meter uses are information hiding, object composition and code reusability.

    C does not have a fixed, language-level way of conveying object patterns like other specifically object-oriented languages do. So, a C coder must define the semantics and syntax by convention. Alex Schreiner’s ‘Object Oriented Programming in ANSI-C’ is a thorough manual for ways of implementing OO techniques in C. For the 2.0 meter, I chose to use the style of libdnet, Dug...

    Show more...

  • Boundary Meter 2.0 – Design

    Posted by on September 27th, 2013

    C is a very simple and powerful language. Because it does not enforce a particular style or organization of code, it can also be very dangerous and easy to write spaghetti code. As they say, it gives you just enough rope to shoot yourself with. However, with careful design, a C project can be grown from a small tech demo to a large application without too many growing pains.

    Object Oriented Design in C

    For the Boundary Meter 2.0 project, I chose to use an object-oriented style of C programming. Though C is not necessarily an OO language, aspects of OO design are still useful within a C project of any size. The main principals that the meter uses are information hiding, object composition and code reusability.

    C does not have a fixed, language-level way of conveying object patterns like other specifically object-oriented languages do. So, a C coder must define the semantics and syntax by convention. Alex Schreiner’s ‘Object Oriented Programming in ANSI-C’ is a thorough manual for ways of implementing OO techniques in C. For the 2.0 meter, I chose to use the style of libdnet, Dug Song’s network programming library, which is very simple and effective. Consider rand.h, the interface for libdnet’s random number generator class.

    typedef struct rand_handle rand_t;
    
    __BEGIN_DECLS
    rand_t	*rand_open(void);
    
    int	 rand_get(rand_t *r, void *buf, size_t len);
    int	 rand_set(rand_t *r, const void *seed, size_t len);
    int	 rand_add(rand_t *r, const void *buf, size_t len);
    
    uint8_t	 rand_uint8(rand_t *r);
    uint16_t rand_uint16(rand_t *r);
    uint32_t rand_uint32(rand_t *r);
    
    int	 rand_shuffle(rand_t *r, void *base, size_t nmemb, size_t size);
    
    rand_t	*rand_close(rand_t *r);
    __END_DECLS

    To begin, class ‘rand_t’ is an opaque handle – a simple pointer. The contents are not exposed to the user in the header file, or in other words, are effectively private. This makes it easy to change the underlying implementation of the random number generator without changing any user-visible APIs or ABIs. The size of the pointer remains constant and can only use the exposed API for interacting with the object. The ‘open’ and ‘close’ methods are the OOP constructor and destructor. The object relies on no global state, and thus allows one to use the library in multiple contexts without unknown side-effects. This handle+methods design allows one to get thread-safe and deterministic performance from the generator even if multiple instances are using different rand_t objects.

    Contrast this to the standard C library’s random number generator, which is not an object-oriented nor extensible design:

    RAND(3)                  BSD Library Functions Manual                  RAND(3)
    
    NAME
         rand, rand_r, srand, sranddev -- bad random number generator
    
    LIBRARY
         Standard C Library (libc, -lc)
    
    SYNOPSIS
         int rand(void);
    
         int rand_r(unsigned *seed);
    
         void srand(unsigned seed);
    
         void sranddev(void);

    There are two versions of this functionality exposed by the C library. The first, srand() and rand() operate on global state that lives in the C library itself. Thus, if a library or program were to use srand() to obtain a deterministic sequence based on a given seed, it is impossible to guarantee that another caller has not called srand() again in the mean-time. To solve this problem, rand_r allows one to provide a local seed context. However, by defining the seed context as ‘unsigned’, the implementation details are exposed, making it impossible to implement a future version that generates more random numbers in sequence than what can be represented by an unsigned, usually only 2**31 possible numbers. Hence ‘bad number generator’.

    The Mediator Pattern

    Central to Meter 2.0 is a minimal container object. This object is implemented as an opaque struct with a set of methods that operate on it, like the ‘rand’ object above. This object provides system time and event services, abstracts a global meter configuration object, and starts all of the subordinate objects. As an object container, it is the glue that holds the rest of the system together. Other object types such as ‘ipfix_writer’ and ‘flow_table’ encapsulate other portions of the meter’s functionality, but they are unaware of each other, only interacting with the container object handle. Inter-object communication is mediated via methods defined on the container object itself.

    The benefits of this design is that each of the subordinate objects is testable independent of the other subordinates. That makes mocking and testing a matter of supplying a container object handle that only implements the methods that the object uses. Each subordinate object is not only testable in isolation, but can be replaced or extended without having to refactor any of the other objects or APIs. It also makes it easy to add new objects without modifying the interface of any subordinate object. New functionality is exposed simply by adding new mediator methods to the container.

    Meter containers

    As an example, consider intf_manager, which collects packets from a network interface and passes them to flow_table for processing. The flow_table object exports a method ‘iterate_active_flows’, which, given a callback, iterates over all flows that have been observed sending data in the last second. The intf_manager object instantiates a flow_table for each network interface, then exports a similar ‘iterate_active_flows’ method that walks every internal flow_table. Likewise, the container exposes an ‘iterate_active_flows’ method that currently calls the method on intf_manager. Finally, the ipfix_writer object (which sends stats to Boundary periodically) calls the container’s iterate_active_flows method to get the list of active flows.

    meter ‘container’ interface:

    struct meter;
    struct meter_opts;
    
    struct meter * meter_open(struct meter_opts *opts);
    
    void meter_start(struct meter *m);
    
    void meter_stop(struct meter *m);
    
    int meter_iterate_active_flows(struct meter *m,
        active_flow_cb_t func, void *arg);
    
    void meter_close(struct meter *m);

    flow_table interface:

    struct flow_table;
    
    struct flow_table * flow_table_open(struct meter *m);
    
    typedef void(*active_flow_cb_t)(const struct flow *, const struct flow_stats *delta, void *arg);
    
    int flow_table_iterate_active_flows(struct flow_table *ft,
        active_flow_cb_t func, void *arg);
    
    void flow_table_close(struct flow_table *ft);

    In a future release of the meter, it may be desirable to expose active flows directly from the flow table maintained by the operating system, rather than processing packets. With this design, we can remove intf_manager and replace it with any other method that implements iterate_active_flows, while leaving ipfix_writer untouched, since it is unaware of intf_manager, it having been abstracted by the container object. In fact, as long as the function signature to iterate_active_flows stays the same, ipfix_writer does not even have to be recompiled.

    meter_2

    Since the container module mediates inter-object communication, one can also swap implementations at runtime based on observed characteristics of the operating system. Consider a meter with a number of specialized active flow exporters based on different versions of Linux.

    meter_3

    One version of iterate_active_flows may use a method that works more efficiently in newer versions of Linux, while the other uses a slower but more compatible method. This abstraction allows us to also implement functionality in the form of loadable modules, rather than linking all object code at build time.  A future implementation of iterate_active_flows could be loaded from a module with support for a newer kernel version, or replaced on the fly with a version that exports test data.

    Events versus Processes versus Threads

    If you want your program to do more than one thing at a time, there are many popular choices: events (where your program schedules concurrent tasks itself), multi-process (where multiple tasks are scheduled by the OS with independent memory space), threading (where multiple tasks are scheduled by the OS in the same memory space), or some combination of these.

    When there are few interactions between multiple tasks, threads and processes can be a simple way to implement concurrency. However, when there is mutable or changeable state between tasks, such as a shared configuration object that multiple tasks access, multiple threads and processes are tricky to implement safely. Without careful design, your code can become a mess of locks and race conditions.  Events have some limitations not present in threads or processes, such as inability to scale past a single CPU and blocking issues when callbacks are not written granularly enough.  However, because they enforce a single flow of execution, events allows for concurrency without requiring locking or atomic operations to safely access shared state.

    The 2.0 meter uses a hybrid of events and threads. Events are used for ‘slow path’ tasks that interact with the common container object. The container object does not implement any locking itself. This is because invariants provided by the event library (libevent) ensure that common objects are not accessed concurrently. Threads are used for the code paths where the performance either needs to scale to multiple CPUs, or needs to be deterministic and non-blockable by other tasks.

    Interaction between threads and events can be tricky, especially in high-performance designs. In the 2.0 meter, ‘interface’ objects run in their own thread, capturing packets and analyzing network flows, one thread per interface. The ‘ipfix_writer’ object runs in the main event loop instead, exporting data once per second back to boundary. There are two kinds of thread/event interaction patterns used within the meter, callbacks and RCU updates.

    RCU and You

    Periodically, an event looks for changes in the configuration of an interface, such as a new IP address. When these occur, the configuration of an interface thread is updated. However, the interface thread may be using the old configuration still. The simplest solution might be to wrap all accesses to the interface configuration object in a lock so it can be updated safely. However, locks are costly to acquire millions of times per second and the data is updated infrequently. To solve this, the meter implements a simple form of RCU.

    struct intf {
        const struct intf_info *info, *new_info;
        pthread_mutex_t new_info_lock;
    };
    
    /*
     * Only called by the fast-path thread.
     */
    const struct intf_info *get_intf_info_fast(struct intf *intf)
    {
        if (intf->new_info) {
            intf_lock_intf_info(intf);
            if (intf->new_info) {
                if (intf->info) {
                    free_intf_info((struct intf_info *)intf->info);
                }
                intf->info = intf->new_info;
                intf->new_info = NULL;
            }
            intf_unlock_intf_info(intf);
        }
        return intf->info;
    }
    
    /*
     * Only called by the slow-path event
     */
    void intf_set_intf_info(struct intf *intf, struct intf_info *intf_info)
    {
        intf_lock_intf_info(intf);
        if (intf->new_info) {
            free_intf_info((struct intf_info *)intf->new_info);
            intf->new_info = NULL;
        }
        intf->new_info = intf_info;
        intf_unlock_intf_info(intf);
    }

    In the common case, the fast-path thread checks if the new_info pointer is set. If it is not, then it is safe to access the interface information directly. When the slow-path event wants to update the interface info, it first acquires the lock, then only updates new_info. Then, the next time the fast-path thread runs, it acquires the lock as well, moves the new interface info into place, the releases the lock. This means that for each update, the lock only needs to be acquired twice.

    Note that this only solves the access problem in one direction. If the slow-path event needs to actually access the current interface info, it additionally needs to acquire the locks for all accesses to ensure the info is not updated while it is being accessed.

    Threads and Callbacks

    Periodically, information about active flows is forwarded back to Boundary. A simple way to do this might be to acquire a lock on the flow table to temporarily pause updates, and then scan for flows who have been updated in the last second. This is like a mark-and-sweep garbage collector in reverse, where it  looks for newly active data structures rather than old.

    For the same reasons as above, acquiring the lock for every table access is expensive when data is exported relatively infrequently. One second resolution is slow when you’re processing millions of packets per second. To avoid the overhead of locks on the flow table, active flow scanning instead done by the interface thread itself while it processing packets. The interface thread loop looks like this:

    void *interface_thread(void * arg)
    {
        struct intf *intf = arg;
        while (intf->running) {
             process_packets(intf);
             if (intf_milliseconds(intf) % 1000 == 0) {
                 run_timers(intf);
             }
        }
    }

    When code from the slow-path requests a scan for active flows, it registers a callback function with the interface thread. Eventually, the run_timers() function is called, when all registered callbacks are then processed. Then, the interface thread removes the callback from its queue, and continues processing packets.

    One nice feature of the libevent library is the ability to instantiate multiple event loops, and then to run them in blocking or non-blocking mode. In the case of the interface thread, it also has its own event loop instance as well, and uses this to manage timer callbacks internally. The implementation of run_timers() above would then include a call to event_base_loop() with the EVLOOP_NONBLOCK flag set.

    How should you design your next project?

    The techniques used in the 2.0 meter are not unique to C. It could have been implemented in almost any language with a similar design. However, C is definitely expressive enough to build object-oriented, loosely coupled code while maintaining high performance. Consider it for your next project!


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • The Infrastructure Engineers Guide To Entrepreneurship

    Posted by on September 16th, 2013

    The Infrastructure Engineer’s Dilemma

    This guide is for the infrastructure engineers of the computing world. You might wonder what I mean by infrastructure engineer. Specifically I’m talking about the people who are responsible for building and scaling large systems. You build the high pressure plumbing that underpins the capabilities of high growth network effect companies like Twitter, Facebook, and LinkedIn. And now, more than ever, there is market opportunity for you to start building companies.

    Traditionally infrastructure engineers get hired onto an engineering staff once certain milestones have been reached in a company. The typical scenario for most companies will be to find product market fit in as short a period of time as possible. This leads to a rapid collection of technical debt that all comes due at precisely the wrong time: once fit is found and adoption starts ramping. Following the traditional script, this would be the time to go out and hire a “neckbeard hacker” to dig out of the technical debt hole.

    For the infrastructure engineer then, the choice in...

    Show more...

  • The Infrastructure Engineers Guide To Entrepreneurship

    Posted by on September 16th, 2013

    The Infrastructure Engineer’s Dilemma

    This guide is for the infrastructure engineers of the computing world. You might wonder what I mean by infrastructure engineer. Specifically I’m talking about the people who are responsible for building and scaling large systems. You build the high pressure plumbing that underpins the capabilities of high growth network effect companies like Twitter, Facebook, and LinkedIn. And now, more than ever, there is market opportunity for you to start building companies.

    Traditionally infrastructure engineers get hired onto an engineering staff once certain milestones have been reached in a company. The typical scenario for most companies will be to find product market fit in as short a period of time as possible. This leads to a rapid collection of technical debt that all comes due at precisely the wrong time: once fit is found and adoption starts ramping. Following the traditional script, this would be the time to go out and hire a “neckbeard hacker” to dig out of the technical debt hole.

    For the infrastructure engineer then, the choice in career is to either clean up someone else’s mess at an earlier stage company and hope that you win the startup lottery or join a more established company where the salary is very large but the upside much more constrained. In either case, though, you are still cut off from the majority of wealth that will be generated. Even if you were to deftly negotiate a good deal with an early stage startup, the amount of equity on the table will pale compared to that of a founder.

    I posit that a market opportunity exists specifically for infrastructure engineers. The big data movement has driven demand for infrastructure engineers higher than ever, however there are not enough of us out there to fill every position. I believe that this market mismatch will lead infrastructure engineers to develop products that displace the need for infra in enterprises.

    The Founder and Company

    The most instructive archetype that I’ve found for infrastructure companies is that of the “Founder as a Service”. You would ideally have some sort of special skill or experience that makes you uniquely qualified to solve a problem extremely well and the company serves as the platform to scale your abilities to a wide number of customers. In essence, instead of solving problems for one company exclusively in exchange for a salary you’re solving a more constrained set of problems for many companies in exchange for revenue. If you can make it happen, it’s a good deal.

    One of the best examples of this sort of founder is Artur Bergman, CEO and co-founder of Fastly. While he was the CTO of Wikia, Artur developed a sort of superpower: through obsessive optimization he was able to squeeze every last bit of latency and throughput from a varnish cache machine. He believed he could build out an in-house CDN cheaper and at higher performance than a traditional CDN provider could. Turns out he was right, and he founded Fastly to scale out his superpower to many customers all at once. Their early t-shirts even said “A little bit of Bergman in every byte” as a tagline.

    It is a pervasive idea in startup land that an inexperienced founder can come out of nowhere and disrupt an entire industry. I used to believe in this myth, however I now recognize it as a form of magical thinking. It is, of course, possible. However it is not likely, especially given the pressures of running a VC funded company. It’s very difficult to develop a superpower when you are accountable to financiers.

    The Pitfalls

    The engineering mindset differs drastically from the business mindset. They are different approaches to the same set of problems based on a disjoint set of values. However, the two worlds are not irreconcilable. Given an open mind and experience, the infrastructure engineer can make an extremely effective founder. These pitfalls are by no means exhaustive, however they are things that I’ve either once thought myself or have seen play out in my friends’ companies.

    We Don’t Need Sales and Marketing

    Much ado has been made recently in the tech press about the consumerization of IT. Attendant to this is the trend of the self selling product. This is a “Field Of Dreams” philosophy; build it and they will come. For those engineers who are uncomfortable selling this is a seductive proposition: all you must do is integrate a credit card signup form. This is the perfect model for selling to engineers in other startups. None of us like to pick up the phone and fewer will answer an email from out of the blue. At some point, though, it becomes necessary to move away from selling to other startups and start selling to enterprises. Generally speaking, enterprises will be more resistant to downturns in the funding market. If funding dries up and you still have revenue, there’s still a chance. However, if funding goes away along with your revenue you will end up in the dead pile.

    Enterprises are prepared to spend money to fix problems. The bigger the problem you can solve, the more money they will be willing to pay. Using the enterprise market to grow average deal sizes sends a strong signal to potential investors that you are maturing well as a company. The drawback, however, is a more involved sales cycle. Enterprise sales require executive sponsorship, in depth support, and perhaps even some custom development. You have to work for their business but it’s worth it. When I talk to other enterprise software companies about their breakdown of customers I tend to hear the same story everywhere. In terms of number of customers the distribution is 80% startups and 20% enterprises. However, something like 80% of the revenue gets generated by those enterprise customers.

    We Rethought X from the Ground Up

    I’ve heard this refrain issued from many engineering driven companies. Where X can be anything from sales, marketing, fundraising, hiring, to even the reporting structure of your company. Like engineering, these are all well established disciplines that have their own taxonomy of known good patterns and reams of readily accessible literature. Like engineering, these disciplines take experience to master. So, as an outsider, maybe you can innovate in one of these areas. However, unless that innovation serves the purposes of helping the company create and retain customers, you are really just engaging in NIH. Innovation is like a random mutation in this way. Your company might end up doing better than its predecessors in a particular area, but the mutation will more likely maim or kill instead.

    Our Technology is Superior

    Your code may be poetry, your architecture may be elegant and scalable and your frontend may be clean and polished. Ultimately none of these things will matter if you are not solving a real problem. It’s a traditional role for infrastructure engineers to be brought in to clean up the sins of the founders. These past experiences may goad you into over-engineering things from the get go. This is not to say that you should ship garbage. There may be very good reasons to choose the latest and greatest in technologies. However, if you and your team care more about the functional purity of your languages than you care about delighting customers then you will end up on the dead pile.

    Inwardly Focused Culture

    The purpose of the business should be to create and retain customers. The purpose of the business is not to accumulate engineers like trading cards. Its purpose is not to raise staggering amounts of money. Its purpose is not to provide a platform for your ego. You should always be asking how you can do more with less. Can we put off raising this next round until we’ve hit more milestones? Will hiring into this position really contribute to our success? There’s a school of thought in startups that you should never forego hiring a talented engineer just because you do not have work for her. The worst-case scenario, or so the theory goes, is that you will be acquihired. In which case each engineer is worth anywhere from $500k to $1mm. The assumption here is that the music won’t stop. If you cannot get acquihired and cannot raise more funds then something will have to give and you will need to lay off all of those talented engineers. It will not be fun. The red flag for an inwardly focused culture is what the company tends to celebrate. Which do you celebrate more: hiring someone twitter famous or selling to a happy customer?

    Learn

    You most likely were not an amazing engineer straight out of college. Likewise you will likely not be an amazing founder immediately. There is, however, a wealth of writing on the topic. It is easy to dismiss all popular management and entrepreneurship texts as squishy and contradictory crap because most of it is. However, there are some foundational texts that are worth reading. The closest thing to a bible for a startup founder should be “Innovation and Entrepreneurship” by Peter Drucker.

    As an infrastructure engineer you are a master of technology. You can build things that others deem impossible. Master how to build a company and you will be unstoppable.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Introducing Boundary Meter 2.0

    Posted by on September 16th, 2013

    The Boundary Meter is the secret weapon behind Boundary that keeps the topology graphs updated in real-time. In order to do this, the Meter collects data every one second thus ensuring you’re never looking at an out-of-date topology view.

    Now, collecting all that data and relaying it back to Boundary is not a job for mere mortals and that’s why Boundary just released Meter 2.0. The new version includes some eye-popping improvements like using up to 90% less CPU and memory! This means Meter 2.0 can now collect more data while putting less load on the host system.

    Meter 2.0 also includes  “flow data buffering,” which lets the Meter automatically store data when it can’t connect to Boundary. Meter 2.0 also adds the ability to automatically discover re-configured network interfaces and automatically adjust for timing variations using the built-in NTP client.  For more information about this release, check out the Boundary Meter 2.0 release notes.

    Most importantly, make sure to upgrade to Meter 2.0 as soon as possible to take advantage of all these improvements. See the...

    Show more...

  • Introducing Boundary Meter 2.0

    Posted by on September 16th, 2013

    The Boundary Meter is the secret weapon behind Boundary that keeps the topology graphs updated in real-time. In order to do this, the Meter collects data every one second thus ensuring you’re never looking at an out-of-date topology view.

    Now, collecting all that data and relaying it back to Boundary is not a job for mere mortals and that’s why Boundary just released Meter 2.0. The new version includes some eye-popping improvements like using up to 90% less CPU and memory! This means Meter 2.0 can now collect more data while putting less load on the host system.

    Meter 2.0 also includes  “flow data buffering,” which lets the Meter automatically store data when it can’t connect to Boundary. Meter 2.0 also adds the ability to automatically discover re-configured network interfaces and automatically adjust for timing variations using the built-in NTP client.  For more information about this release, check out the Boundary Meter 2.0 release notes.

    Most importantly, make sure to upgrade to Meter 2.0 as soon as possible to take advantage of all these improvements. See the meter documentation for platform specific instructions on how to upgrade.

    If you any questions, please contact us at support@boundary.com.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Friday the 13th brings in more AWS woes

    Posted by on September 13th, 2013

    Today, Amazon experienced a degradation of services — you probably heard about it. It was reported that an encryption key storage service inside Amazon Web Services, the Hardware Security Module appliance, was affected by connectivity problems over a period of an hour and 18 minutes in one availability zone this morning.

    The issue seemed to only affect  one availability zone, the Amazon Service Health Dashboard, but it then caused a failover to alternative systems for customers who rely heavily on low latency.

    We started to see some issues at 7:32am pst. across our AWS customers and a quick check on twitter feeds confirmed the issue wasn’t a Friday the 13th hoax.

    AWS Traffic - Dashboard

    Using our Dashboard we isolated only the Amazon specific traffic, our dashboard automatically showed us trends that allowed us to compare and understand the difference and percentage in change. This screen is comparing today’s AWS traffic versus yesterday’s – you can see the change is dramatically different!

    One of our customers Rhommel Lamas a systems operations engineer for 3scale.net an API management company in...

    Show more...

  • Friday the 13th brings in more AWS woes

    Posted by on September 13th, 2013

    Today, Amazon experienced a degradation of services — you probably heard about it. It was reported that an encryption key storage service inside Amazon Web Services, the Hardware Security Module appliance, was affected by connectivity problems over a period of an hour and 18 minutes in one availability zone this morning.

    The issue seemed to only affect  one availability zone, the Amazon Service Health Dashboard, but it then caused a failover to alternative systems for customers who rely heavily on low latency.

    We started to see some issues at 7:32am pst. across our AWS customers and a quick check on twitter feeds confirmed the issue wasn’t a Friday the 13th hoax.

    AWS Traffic - Dashboard

    Using our Dashboard we isolated only the Amazon specific traffic, our dashboard automatically showed us trends that allowed us to compare and understand the difference and percentage in change. This screen is comparing today’s AWS traffic versus yesterday’s – you can see the change is dramatically different!

    One of our customers Rhommel Lamas a systems operations engineer for 3scale.net an API management company in Spain then tweeted at 8:00am.

    “Thanks @boundary to help us find the failing AZ on our US-EAST-1b”.

    Rhommel found his production systems failing over to other availability zones in US-East and US-West, as designed and planned, in the event of business-risking operational problems. Rhommel Lamas said their production system in one section of the B availability suffered major latencies as it tried to connect to D and E availability zones. It also suffered the same latency as it attempted to connect to another B zone section, according to a map drawn by their Boundary Dashboard.

    Boundary_-_Application_Visualization (1)

    The latencies amounted to 1,260ms or 1.26 seconds, a crippling delay for the latency sensitive business of managing customers’ APIs. Such a latency would back up thousands of API requests on highly trafficked websites, putting the customers of 3scale.net at risk of losing site visitors and business. The interesting thing is that Lamas spotted the delays building in the Boundary Dashboard a few minutes before Amazon reported the issue on its dashboard and was able to see their AWS Region B on US-East failing with increasing latency issues and errors between machines in different zones losing messaging packets as well in attempted communications between the zones.

    At Boundary, many of our customers have some portion of their infrastructure deployed in AWS. Looking at the aggregate data coming back from our customers, we can observe and measure the health of the Amazon infrastructure. We did, indeed, observe some interesting behavior.


    One of the best use cases for this kind of issue is our ability to drill down and isolate which AZ (availability zone) is having issues.  Amazon randomly assigns customers to AZ’s, so my US-EAST-1a is not necessarily your US-EAST-1a.  Thus the informal method of tweeting to other operations professionals and comparing notes becomes frustrating and error prone. Boundary cuts right to the data and shows you the effected zones in seconds.


    F (1)In this “Streams” screenshots we’ve left in three other major service providers to give a comparison of their behavior – you can see little deviation for them in the periods we’ve displayed.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Better Event Management with Boundary

    Posted by on September 9th, 2013

    Boundary AppVis with Event ConsoleWhen it comes to IT Management, everything starts with an event. Events are generated for all kinds of reasons: server crashes, networks outages and applications failures to name just a few. A moderately sized IT Environment can easily generate hundreds of thousands events a day, while some large data centers can generate upwards of a million events a day!

    The real challenge is not just receiving events but making sense of them all. At Boundary, we find that being able to see your application topology is a great way to get an overview of what’s happening in your IT Environment. In Boundary, we refer to this view as AppVis, which graphically depicts how your applications are communicating in real-time.

    While AppVis provides a great high-level view of your IT Environment, it doesn’t replace the need for a traditional event management console. At some point, IT operators need  to dive into the event details and see exactly what’s happening. The Boundary Event Console provides this detailed view and lets operators open, close and acknowledge events as needed.

    The real power of...

    Show more...

  • Better Event Management with Boundary

    Posted by on September 9th, 2013

    Boundary AppVis with Event ConsoleWhen it comes to IT Management, everything starts with an event. Events are generated for all kinds of reasons: server crashes, networks outages and applications failures to name just a few. A moderately sized IT Environment can easily generate hundreds of thousands events a day, while some large data centers can generate upwards of a million events a day!

    The real challenge is not just receiving events but making sense of them all. At Boundary, we find that being able to see your application topology is a great way to get an overview of what’s happening in your IT Environment. In Boundary, we refer to this view as AppVis, which graphically depicts how your applications are communicating in real-time.

    While AppVis provides a great high-level view of your IT Environment, it doesn’t replace the need for a traditional event management console. At some point, IT operators need  to dive into the event details and see exactly what’s happening. The Boundary Event Console provides this detailed view and lets operators open, close and acknowledge events as needed.

    The real power of AppVis and the Boundary Event Console come when you combine them. In Boundary, we recently added the ability to see events and application topology in one integrated view. By combining these views you can see both the high-level topology and detailed event information in one place. This makes it easier to troubleshoot and understand exactly how an event is impacting your IT Environment.

    To learn more watch this short overview and demo of Boundary or sign up for a free trial.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Shout-Out @VisualDNA Adrian Cockcroft, Netflix

    Posted by on September 4th, 2013

    Adrian Cockcroft of Netflix was in London recently discussing how Netflix manage for scale and complexity. The discussion was captured and posted (See Below). Adrian talks about Netflix’s implementation of Cassandra in AWS and discusses Boundary’s ability to measure App Flows in Amazon from zone-to-zone, region-to-region, node-to-node and to the client. While also referring how they monitor an instances interaction with third party services like S3.

    “Boundary can Measure flows across all bandwidths at a 1 second update continuously. I call this wireshark as a service. But its much more than that really…”

    Adrian.

    Cassandra London @VisualDNA | Speaker: Adrian Cockcroft, Netflix


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Shout-Out @VisualDNA Adrian Cockcroft, Netflix

    Posted by on September 4th, 2013

    Adrian Cockcroft of Netflix was in London recently discussing how Netflix manage for scale and complexity. The discussion was captured and posted (See Below). Adrian talks about Netflix’s implementation of Cassandra in AWS and discusses Boundary’s ability to measure App Flows in Amazon from zone-to-zone, region-to-region, node-to-node and to the client. While also referring how they monitor an instances interaction with third party services like S3.

    “Boundary can Measure flows across all bandwidths at a 1 second update continuously. I call this wireshark as a service. But its much more than that really…”

    Adrian.

    Cassandra London @VisualDNA | Speaker: Adrian Cockcroft, Netflix


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Business Insider and Boundary

    Posted by on August 26th, 2013

    business-insider-logo-250Growing media company uses Boundary’s consolidated operations management service to help resolve issues and  outages on its content sites before readers notice.

    Business Insider is a fast-growing business site with deep financial, media, technology and other industry verticals. The flagship vertical, Silicon Alley Insider, launched on July 19, 2007, led by DoubleClick founders Dwight Merriman and Kevin Ryan and former top-ranked Wall Street analyst Henry Blodget. One year after launch, many more verticals joined Silicon Alley Insider, including Money Game and The Wire, sites which have been re-launched under one brand to focus on building the leading online business news site for the digital age.

    Business Insider has a 10-person engineering team and hosts its environment through DataPipe’s IaaS platform. While the media company was using tools such as Nagios and Graylog2, the technical team didn’t have a visual picture of network anomalies, which began occurring more frequently as the company grew, according to Ben Sgro, Director of Software Development at Business...

    Show more...

  • Business Insider and Boundary

    Posted by on August 26th, 2013

    business-insider-logo-250Growing media company uses Boundary’s consolidated operations management service to help resolve issues and  outages on its content sites before readers notice.

    Business Insider is a fast-growing business site with deep financial, media, technology and other industry verticals. The flagship vertical, Silicon Alley Insider, launched on July 19, 2007, led by DoubleClick founders Dwight Merriman and Kevin Ryan and former top-ranked Wall Street analyst Henry Blodget. One year after launch, many more verticals joined Silicon Alley Insider, including Money Game and The Wire, sites which have been re-launched under one brand to focus on building the leading online business news site for the digital age.

    Business Insider has a 10-person engineering team and hosts its environment through DataPipe’s IaaS platform. While the media company was using tools such as Nagios and Graylog2, the technical team didn’t have a visual picture of network anomalies, which began occurring more frequently as the company grew, according to Ben Sgro, Director of Software Development at Business Insider.

    The company began using Boundary’s consolidated operations management service in April, 2013, and it has quickly become part of daily IT operations and event management processes. “We glance up at the monitor and we always know what the network is looking like and we can see instantly if there’s any sort of odd behavior or if a server is experiencing issues,” says Sgro. Other benefits include:

    Quick resolution of outages:  Boundary helped identify the cause of a network outage involving a failed switch, so that visitors to the site were affected only briefly. “Without Boundary, it would have been really hard to know where to start troubleshooting that issue,” Sgro remarks.

    Proactive monitoring:  Before Boundary, the IT team would receive emails from editors when the site was down or running slowly. Now, through Boundary’s real-time status of the network infrastructure, they can often identify and fix an issue before staff notices.

    Load testing:  Business Insider is also using Boundary to help with deployments and load testing. Boundary provides a view of the overall network behavior so that Sgro and team can clearly understand network load during testing.

    “Boundary is the first thing we look at every day,” says  Sgro.  “We are serving about 25 million unique visitors a month. If the site is not up, we’re losing revenue, so it’s extremely important for us to identify anomalies and track network behavior all the time.”

     


    Tweet this!
    Share on LinkedIn
    Send to a friend