What do you want to filter our blog on?

  • The Power of One-Liner…Metrics!

    Posted by on September 12th, 2014

    A developer, a designer, and a data scientist walk into a bar…and then Facebook buys the bar for 3 Billion Dollars!

    I have always been a huge fan comedians who have the power of one-liners. My earliest exposure to a comedian that performs in this genre was Henny Youngman with his infamous wife jokes: “Take my wife … please.” The recently departed comedian and actor Robin Williams was also the master of the one-liner with: “Reality…what a concept” and “If it’s the Psychic Network why do they need a phone number?” to quote just a few memorable quips.

    Similar to the comedians ability to incite laughter with a single line, a devops engineer is equally adept at issuing a one-liner in Bash, or similar shell, to query a process or service to collect metrics indicative of its health. Boundary Premium and the Shell plugin can amplify the power of the one-liner metric command by providing a graphical plot of the metric over time.

     

    How does it Work?

    Boundary’s Shell plugin is a generic plugin that allows the use of any program or scripting language to produce metrics for...

    Show more...

  • The Power of One-Liner…Metrics!

    Posted by on September 12th, 2014

    A developer, a designer, and a data scientist walk into a bar…and then Facebook buys the bar for 3 Billion Dollars!

    I have always been a huge fan comedians who have the power of one-liners. My earliest exposure to a comedian that performs in this genre was Henny Youngman with his infamous wife jokes: “Take my wife … please.” The recently departed comedian and actor Robin Williams was also the master of the one-liner with: “Reality…what a concept” and “If it’s the Psychic Network why do they need a phone number?” to quote just a few memorable quips.

    Similar to the comedians ability to incite laughter with a single line, a devops engineer is equally adept at issuing a one-liner in Bash, or similar shell, to query a process or service to collect metrics indicative of its health. Boundary Premium and the Shell plugin can amplify the power of the one-liner metric command by providing a graphical plot of the metric over time.

     

    How does it Work?

    Boundary’s Shell plugin is a generic plugin that allows the use of any program or scripting language to produce metrics for the Boundary Premium product. The plugin relay expects a script or program to send metrics via standard output with the given format:

    <METRIC_NAME> <METRIC_VALUE> <METRIC_SOURCE>\n

    where:

    METRIC_NAME is a previously defined metric
    METRIC_VALUE is the current value of the metric
    METRIC_SOURCE is the source of the metric

    Here is one-liner example which outputs the current number of running processes:

    $echo "BOUNDARY_PROCESS_COUNT $(ps -e | egrep '^.*\d+' | wc -l | tr -d ' ') $(hostname)"

    which yields this output:

    BOUNDARY_PROCESS_COUNT 205 boundary-plugin-shell-demo

    We can take this one-liner and then configure the Shell plugin to periodically report and display this metric:

    bscreenconfig

    In configuration form above, we have defined our metric be collected by setting the command field with the one-liner as an argument to the bash shell using the -c command:

    bash -c "echo BOUNDARY_PROCESS_COUNT $(ps -e | egrep '^.*\d+' | wc -l | tr -d ' ') $(hostname)"

    The Poll Time field is set to 5 so that the metric command is run every 5 seconds to provide an update of our metric.

    We can now display our new metric in a dashboard as shown here:

    blog_dash_window

    Exit Stage Left

    If you want to know more about the Shell plugin and view more examples of its use, see the documentation located here.

    Okay, let’s end on a one-liner:

    Two MySQL DBAs walk in to a NoSQL bar, but they had to leave because they couldn’t find any tables!

    I’m outta here, good night!


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • More RabbitMQ than Sainsburys

    Posted by on September 11th, 2014

    There is an old slang expression that I first heard while growing up in South London, “more rabbit than Sainsburys”. It was used in reference to a person and was a way of saying that they talked too much, almost incessantly and way more quantity than quality. For those of you not familiar with the expression it might help to know that Sainsbury was and still is a food retailer in the UK and that when I was a child, a lot of people ate rabbit and, in fact, rabbit was one of the cheaper cuts of meat and far more available than chicken as an example. The rabbit in the expression refers to “talk” based on the rhyming slang of “rabbit and pork”, talk.

    This week I was seriously impressed with a rabbit but not of the furry kind. It was in fact a software package called RabbitMQ that I am sure many of you know already. The reason that I was so impressed was that, despite my questionable competence in Java that I mentioned a couple of weeks ago in another blog,  I was able to create a working RabbitMQ load simulator in just a couple of days, starting from a basis of no...

    Show more...

  • More RabbitMQ than Sainsburys

    Posted by on September 11th, 2014

    There is an old slang expression that I first heard while growing up in South London, “more rabbit than Sainsburys”. It was used in reference to a person and was a way of saying that they talked too much, almost incessantly and way more quantity than quality. For those of you not familiar with the expression it might help to know that Sainsbury was and still is a food retailer in the UK and that when I was a child, a lot of people ate rabbit and, in fact, rabbit was one of the cheaper cuts of meat and far more available than chicken as an example. The rabbit in the expression refers to “talk” based on the rhyming slang of “rabbit and pork”, talk.

    This week I was seriously impressed with a rabbit but not of the furry kind. It was in fact a software package called RabbitMQ that I am sure many of you know already. The reason that I was so impressed was that, despite my questionable competence in Java that I mentioned a couple of weeks ago in another blog,  I was able to create a working RabbitMQ load simulator in just a couple of days, starting from a basis of no understanding at all of RabbitMQ and not even knowing where to find the download. Perhaps in my subconscious I still had some primal knowledge from my brief exposure to IBM MQ Series back in the 1990’s and that in some way helped me get to success quicker, but I was amazed how easy it was to install, use and understand. The documentation was accurate and even entertaining at times. I even enabled the management plug-in for RabbitMQ to watch my application in action.

    rabbitmq 27

    From the above picture, you will also see that I enjoy listening to Ultravox and run Windows, two clear signs that I am definitely not a professional developer.

    The reason that I was doing this RabbitMQ exercise was twofold:

    1. To gain an understanding of RabbitMQ as part of our plans to expand the set of plug-ins that we will be providing in Boundary Premium. Right now, RabbitMQ is in the top 5 possibly the top 3.
    2. To simulate a Java workload that used a messaging service for data ingress and that would use many threads to process those messages. I wanted to create a simulation environment to replicate “Software Bottlenecks”. I was particular pleased with the way that the application increases the message rate gradually to a peak at the mid-point of the test and then declines after the mid-point. See the smooth curve in the top left of the picture above.

    double sleepTime = distanceFromMiddle/(messages/2)*define.INTERVAL + define.INTERVALBASE;

    Anyway, my exposure to RabbitMQ was fulfilling and enjoyable. It helped me understand more clearly that the overall customer experience (mine in this case) is a major part of a product or service. I am sure that I could have tried to install any other messaging service and achieved the same result but would I have had such a good experience that I wanted to blog about it?


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Swisscom selects Boundary for Cloud Monitoring and Operations Management

    Posted by on September 11th, 2014

    Swisscom, one of the largest providers of IT services in Switzerland, has selected Boundary to provide monitoring and operations management of its future cloud infrastructure and in parallel, Swisscom Ventures has made a strategic investment in Boundary.

    In a strategic effort to offer its customers additional solutions and take advantage of its unique capabilities, the company is in the process of building a public cloud primarily based on OpenStack, as well as offering multiple new cloud-based services. Boundary, designed specifically to monitor modern, dynamic infrastructures, was selected as a core component of this new solution.

    To maintain operational awareness, complete visibility is needed across all components of the infrastructure. Boundary accomplishes this by combining metrics data collection at one-second resolution with event/log ingestion and processing. By collecting and processing hundreds of millions of metrics and events every second using a highly scalable low latency streaming technology, organizations are able to find problems that were previously...

    Show more...

  • Swisscom selects Boundary for Cloud Monitoring and Operations Management

    Posted by on September 11th, 2014

    Swisscom, one of the largest providers of IT services in Switzerland, has selected Boundary to provide monitoring and operations management of its future cloud infrastructure and in parallel, Swisscom Ventures has made a strategic investment in Boundary.

    In a strategic effort to offer its customers additional solutions and take advantage of its unique capabilities, the company is in the process of building a public cloud primarily based on OpenStack, as well as offering multiple new cloud-based services. Boundary, designed specifically to monitor modern, dynamic infrastructures, was selected as a core component of this new solution.

    To maintain operational awareness, complete visibility is needed across all components of the infrastructure. Boundary accomplishes this by combining metrics data collection at one-second resolution with event/log ingestion and processing. By collecting and processing hundreds of millions of metrics and events every second using a highly scalable low latency streaming technology, organizations are able to find problems that were previously invisible, hidden in averages and one-minute sampling.

    Boundary includes a rich set of APIs, which will enable Swisscom to easily integrate monitoring into their cloud offering. These coupled with the solution’s multi-tenant capabilities, make it an ideal solution to manage Swisscom Cloud, which will eventually host hundreds of thousands and possibly millions of server instances.

    “The dynamic nature of cloud infrastructures can provide a monitoring and management challenge,” said Torsten Boettjer, Head of Technical Strategy Cloud at Swisscom. “With Boundary, we’ll be able to provide our customers with the best possible levels of service for their applications that will be on our Cloud infrastructure.”

    Swisscom Ventures invests in innovative areas that are strategic for Swisscom with cloud-based solutions being a focused area of investment.
    “We were impressed not only with the Boundary solution but also with the management team and their innovative views of the future of Web-scale IT,” said Stefan Kuentz, Investment Director at Swisscom Ventures. “We are investing in their vision and are excited to support them on this journey.”


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Customer success and the Billy Bullshitter experience

    Posted by on September 5th, 2014

    About a week ago, I wrote a blog post about delivering high quality customer service and getting our entire organization aligned behind this goal.

    I have a few updates to share on this subject both good and errr, not so good.

    Let’s start with some dirty laundry. One of the success mantras for our team is that we should never have the same question asked more than once from our customers.

    If our product and documentation are not simple, obvious and helpful enough to enable customers to self-serve, then we are failing. If a customer asks us something, then that becomes an opportunity to improve – either the product or the docs. Remember how I talked about being proactive?

    Well, it wasn’t fully happening. In some areas we were doing really well but in others we were not. Not sure exactly why – maybe individuals didn’t feel empowered to be proactive, maybe they were worried about making a mistake, maybe just taking the easy path but whatever the reason, I consider it my failing that I have not managed to communicate and motivate the entire team to follow this path. (It was...

    Show more...

  • Customer success and the Billy Bullshitter experience

    Posted by on September 5th, 2014

    About a week ago, I wrote a blog post about delivering high quality customer service and getting our entire organization aligned behind this goal.

    I have a few updates to share on this subject both good and errr, not so good.

    Let’s start with some dirty laundry. One of the success mantras for our team is that we should never have the same question asked more than once from our customers.

    If our product and documentation are not simple, obvious and helpful enough to enable customers to self-serve, then we are failing. If a customer asks us something, then that becomes an opportunity to improve – either the product or the docs. Remember how I talked about being proactive?

    Well, it wasn’t fully happening. In some areas we were doing really well but in others we were not. Not sure exactly why – maybe individuals didn’t feel empowered to be proactive, maybe they were worried about making a mistake, maybe just taking the easy path but whatever the reason, I consider it my failing that I have not managed to communicate and motivate the entire team to follow this path. (It was also a failure of some of our reporting metrics internally where we didn’t have the full visibility needed – which we do now). Need to improve.

    Another area we tackled this week is how we treat users when they sign up for our service. We tested many other monitoring solutions and in every case, we received only automated emails. Personally I don’t mind automated emails, they are maybe a necessary evil but what I do object to is automated emails that try and look like a real person. One large, very well known application monitoring vendor is clearly expert at this….trying to make automated emails look like a real person.

    We decided to try and do better, so we tasked our customer and technical success teams to respond manually to every user that signed up. We are not sure if this is scalable because already the volume is a little overwhelming but we tried to make it as easy as possible for our team to cope by giving them several different templates that they could use as their base and then to modify them accordingly. So, our current process is:

    1. User registers on web site
    2. User gets immediate access to product (no validation needed)
    3. This creates a “lead” in salesforce that, depending on geography is assigned to a Customer Success Manager
    4. CSM looks at the lead, looks at the individual that has signed up and sends an introductory email and also introduces the technical success manager
    5. If we haven’t heard anything, the technical success manager will follow up the next day trying to provide any assistance that the user may need or get feedback.

    We actually turned off the automated “welcome” email as part of doing this….welcome email is now from a human.

    Of course, no good deed goes unpunished. A user recently challenged us publicly….even referring to us as Billy Bullshitters (I love that)….suggesting that our emails were automated because they were sent from salesforce. Yes, our CSMs sit in front of salesforce and yes they use templates to help them deal with volume but no they are not automated and they are real people typing and pushing buttons to try and give just a little more personal service.

    We’ll see how it goes, but I am hoping that we can continue this path and not resort to automated everything as so many do – but then again I could be delusional and the robots may win.

    Would love to hear from others that have tried to be “better than the pack” – and will keep updating on progress.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Rules of Thumb

    Posted by on September 4th, 2014

    Who ever said that 60% CPU was too high?

    In performance management there is something known as a “rule of thumb”. It is basically a guideline that has no real justification e.g. “as a rule of thumb, do not run your servers higher than 60% CPU busy”. People have lived by these rules for many years and rarely challenge them. Occasionally people will ask “why?” and be answered with a look of disdain as if they had asked a really stupid question.

    Truth is: many people don’t know why rules-of-thumb are valid in any circumstance. They just use them as they have no better basis for making judgments on performance.

    Recently at Boundary we had a situation where one of our own systems seemed to be creaking at the seams due to an apparent overload situation. The belief was that there was simply too much work coming in to the system and that we may need a bigger box or more boxes to handle it. The basis of that belief is in the title of this blog. When the application was starting to lag, the server CPU was hitting over 60% utilization. All other indicators were healthy – no...

    Show more...

  • Rules of Thumb

    Posted by on September 4th, 2014

    Who ever said that 60% CPU was too high?

    In performance management there is something known as a “rule of thumb”. It is basically a guideline that has no real justification e.g. “as a rule of thumb, do not run your servers higher than 60% CPU busy”. People have lived by these rules for many years and rarely challenge them. Occasionally people will ask “why?” and be answered with a look of disdain as if they had asked a really stupid question.

    Truth is: many people don’t know why rules-of-thumb are valid in any circumstance. They just use them as they have no better basis for making judgments on performance.

    Recently at Boundary we had a situation where one of our own systems seemed to be creaking at the seams due to an apparent overload situation. The belief was that there was simply too much work coming in to the system and that we may need a bigger box or more boxes to handle it. The basis of that belief is in the title of this blog. When the application was starting to lag, the server CPU was hitting over 60% utilization. All other indicators were healthy – no memory issues, no disk queuing, and load average was low as well. As the application is a time critical streaming service, perhaps there weren’t enough CPU cycles in each nanosecond to process all the work in time. As a result, some people hypothesized that it must be a CPU issue based on the general rule-of-thumb.

    To explain the 60% rule-of-thumb you can look to a branch of mathematics, known as queuing theory. It is something that I actually studied a very long time ago and understood just enough to help me project an air of expertise when I was a performance management consultant.

    The one formula that I can still quote is: “total time equals service time over one minus the utilization”. It includes the Greek alphabet letter “rho” for the utilization number and this use of a dead language symbol adds to the mystique of the formula. This formula is for a single server queuing system with “exponential arrival rate” and “exponential service times”. It turns out that a CPU with a single processing unit with a varied workload fits this model quite well. If we use 1 second as an example for the service time we can estimate how long a “transaction” would take relative to the utilization of that single CPU.

    Utilization Total Time Time in Queue
    20% 1 * (1 – .2) =  1.25 seconds 0.25 seconds
    40% 1 * (1 – .4) =  1.66 seconds 0.66 seconds
    60% 1 * (1 – .6) =  2.50 seconds 1.50 seconds
    80% 1 * (1 – .8) =  5.00 seconds 4.00 seconds

    If you draw this as a curve and fill in a few more data-points:

    rivington-rules-of-thumb-graph-600

     

    You can see that the curve has a significant acceleration phase at the 60 – 80 percent utilization range. Which is why 60% is viewed as a good rule-of-thumb for CPU utilization. But the key is that this ROT is based on “exponential arrival rate” and “exponential service time” for a single server system. In our case, we are running a high volume service that runs in Java Virtual Machines and is deployed on physical servers that have up to 40 CPUs. There is no other work on those systems. It turns out that none of the characteristics of arrival rate or service time applied in this case primarily due to the nature of the streaming application and the way that the JVM manages its own workload. As a result, the 60% CPU utilization ROT was unlikely to be valid and so we spent more time diagnosing the issue. Cliff Moon, our CTO, found the real issue (thanks Cliff!) and it had nothing to do with the CPU. In fact it had nothing to do with any hardware resource. We had a software bottleneck in the system, a simple JVM setting that had been propagated from when this particular application ran on smaller servers. By increasing the maximum number of threads that the JVM could use, the application immediately started to consume more CPU and the workload lag was history. In a sense that software bottleneck was like a trap that sprung when the workload hit a certain level. Right up to that level, everything was fine. But over that level and the lag started to appear. It was pure coincidence that it happened at 60% CPU.  In this case 60% CPU utilization was not a problem at all and, in fact, was too low!

    The final result is that we have proven that we can now run those servers at almost 100% utilization on average over long periods, and have achieved workload rates that we had never seen before. The takeaway here is that if you really want to maximize performance then you may need to challenge some long held rules-of-thumb and examine performance in a much more clinical way.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Actors, Green Threads and CSP On The JVM – No, You Can’t Have A Pony

    Posted by on September 3rd, 2014

    I really wish people would stop building actor frameworks for the JVM. I know, I’m guilty of having done this myself in the past. Invariably, these projects fall far short of their intended goals, and in my opinion the applications which adopt them end up with a worse design than if they had never incorporated them in the first place.

    Let’s take a step back, however. What the hell are actors, and why is everyone so hot and bothered by them? The actor model describes a set of axioms to be followed in order to avoid common issues with concurrent programming, and in the academic world it provides a means for the theoretical analysis of concurrent computation. Specific implementations can vary substantially in how they define actors, and in the restrictions on what actors can and cannot do, however the most basic axioms of the actor model are:

    1. All actor state is local to that actor, and cannot be accessed by another.
    2. Actors must communicate only by means of message passing. Mutable messages cannot be aliased.
    3. As a response to a message an actor can: launch new actors, mutate its...

    Show more...

  • Actors, Green Threads and CSP On The JVM – No, You Can’t Have A Pony

    Posted by on September 3rd, 2014

    I really wish people would stop building actor frameworks for the JVM. I know, I’m guilty of having done this myself in the past. Invariably, these projects fall far short of their intended goals, and in my opinion the applications which adopt them end up with a worse design than if they had never incorporated them in the first place.

    Let’s take a step back, however. What the hell are actors, and why is everyone so hot and bothered by them? The actor model describes a set of axioms to be followed in order to avoid common issues with concurrent programming, and in the academic world it provides a means for the theoretical analysis of concurrent computation. Specific implementations can vary substantially in how they define actors, and in the restrictions on what actors can and cannot do, however the most basic axioms of the actor model are:

    1. All actor state is local to that actor, and cannot be accessed by another.
    2. Actors must communicate only by means of message passing. Mutable messages cannot be aliased.
    3. As a response to a message an actor can: launch new actors, mutate its internal state, or send messages to one or more other actors.
    4. Actors may block themselves, but no actor should block the thread on which it is running.

    So what are the advantages to adopting the actor model for concurrent programming? The primary advantages center around the ergonomics of concurrency. Concurrent systems are classically very hard to reason about because there are no ordering guarantees around memory mutation beyond those which are manually enforced by the programmer. Unless a lot of care, planning and experience went into the design of the system, it inevitably becomes very difficult to tell which threads might be executing a given piece of code at a time. The bugs that crop up due to sloppiness in concurrency are notoriously difficult to resolve due to the unpredictable nature of thread scheduling. Stamping out concurrency bugs is a snipe hunt.

    By narrowing the programming model so drastically, actor systems are supposed to avoid most of the silliness encountered with poorly designed concurrency. Actors and their attendant message queues provide local ordering guarantees around delivery, and since an actor can only respond to a single message at a time you get implicit locking around all of the local state for that actor. The lightweight nature of actors also means that they can be spawned in a manner that is 1:1 with the problem domain, relieving the programmer of the need to multiplex over a thread pool.

    Actor aficionados will probably reference performance as an advantage of actor frameworks. The argument for superior performance of actors (and in particular the green thread schedulers that most actor implementations are built upon) comes down to how a server decomposes work from the client and how that work gets executed on a multi-core machine. The typical straw-man drawn up by actor activists is a message passing benchmark using entirely too many threads, run on a cruddy macbook. It’s easy to gin up some hackeneyed FUD against threads to market an actor framework. It’s much harder to prove a material advantage to adopting said framework.

    Unfortunately, actor frameworks on the JVM cannot sufficiently constrain the programming environment to avoid the concurrency pitfalls that the actor model should help you avoid. After all, within the thread you are simply writing plain old java (or scala or clojure). There’s no real way to limit what that code can do, unless it is explicitly disallowed from calling into other code or looping. Therefore, even the actor frameworks which use bytecode weaving to implement cooperative multi-tasking amongst actors cannot fully guarantee non-blocking behavior. This point bears repetition: without fundamental changes in how the JVM works, one cannot guarantee that an arbitrary piece of code will not block.

    When making engineering decisions we must always be mindful of the tradeoffs we make and why we make them. Bolt on actor systems are complex beasts. They often use bytecode weaving to alter your code, hopefully without altering its meaning. They quite often rely on Java’s fork/join framework, which is notorious for its overhead, especially when it comes to small computations, and is fantastically complicated when compared to a vanilla thread pool. Actor systems are supposed to make parallel computation dead simple, but every lightweight threading system on the JVM that I’ve seen is anything but simple.

    Lest you think that I am a hater, I genuinely like actor oriented programming. I have been an enthusiastic Erlang programmer for a number of years, and I used to get genuinely excited about the activity around adding this paradigm to Java. However, I am now convinced that without support from the platform these lightweight concurrency libraries will always be a boondoggle. I’m not the only one to make this observation, either.

    We shouldn’t be trusting vendors who are pushing manifestos, decades old tribal knowledge about thread implementations, and misleading benchmarks. We should be building the simplest possible systems to solve our problems, and measuring them to understand how to get the most out of our machines.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Microservices, or How I Learned To Stop Making Monoliths and Love Conway’s Law

    Posted by on August 27th, 2014

    After reading this post I can’t help but feel that the author has missed the point of having a microservices architecture (he misses other things as well, particularly the fact that there’s a lot more folks out there writing software than just neckbeards and hipsters). Especially considering the suggestion of service objects as a way to implement microservices. Most importantly, the reason to prefer a microservice based architecture is not for encapsulation, data locality, or rigid interfaces. Microservice architectures are embraced because of Conway’s law.

    organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations

    —M. Conway

    An important corollary of Conway’s law, in my experience, has been that teams will tend to scale about as well as the software that they create. Teams working on a monolithic codebase will inevitably begin stepping on each other’s toes, which requires more rigid software engineering processes, specialized roles such as build and release engineering, and ultimately a steep...

    Show more...

  • Microservices, or How I Learned To Stop Making Monoliths and Love Conway’s Law

    Posted by on August 27th, 2014

    After reading this post I can’t help but feel that the author has missed the point of having a microservices architecture (he misses other things as well, particularly the fact that there’s a lot more folks out there writing software than just neckbeards and hipsters). Especially considering the suggestion of service objects as a way to implement microservices. Most importantly, the reason to prefer a microservice based architecture is not for encapsulation, data locality, or rigid interfaces. Microservice architectures are embraced because of Conway’s law.

    organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations

    —M. Conway

    An important corollary of Conway’s law, in my experience, has been that teams will tend to scale about as well as the software that they create. Teams working on a monolithic codebase will inevitably begin stepping on each other’s toes, which requires more rigid software engineering processes, specialized roles such as build and release engineering, and ultimately a steep decline in the incremental productivity of each new engineer added to the team.

    The OP claims not to know of a good definition for microservices. I’d like to propose one: a microservice is any isolated network service that will only perform operations on a single type of resource. So if you have the concept of a User in your service domain, there ought to be a User microservice that can perform any of the operations required to deal with a user: new signups, password resets, etc.

    This definition jibes well with Conway’s law and the real reasons why microservices are good. By limiting services to operating on a single type of resource we tend to minimize the interaction with other components, which might be under parallel development or not even implemented yet. Every dependency must be carefully considered because it adds overhead for the implementor, not just in code but in communications. Indeed a microservice architecture resembles what your team actually is: a distributed system composed of mostly independent individuals.

    And to the claim that microservices introduce a distributed system into what was once a non-distributed environment, all I can say is this: anyone building software delivered via the web who doesn’t think they are working on a distributed system fundamentally misapprehends the nature of what they’re doing. We’re all distributed systems developers now, whether we realize it or not.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • The New Factory Floor

    Posted by on August 26th, 2014

    I get a lot of weird promoted tweets that I often make fun of, however, one I got last night made me think.

    Bv8B-EpCQAAI2qz

    So, ostensibly this is meant to entice potential job candidates. However, the only thing discussed is the set of technology in use. There’s no mention of the mission, what the company does, or who you’d be working with. This got me thinking about a number of trends I’ve seen recently in the technology world: the rise of so called “hacker schools”, the talent crunch, and the growing popularity of server side JavaScript. Suddenly, to be eminently employable at a relatively high level of income one need only attend a 2-3 month course, learn JavaScript and put together a few portfolio projects. The ability to do it all only learning one language lowers the barriers to entry further than they’ve ever been.

    These trends remind me of the first rise of manufacturing jobs in the US after the second world war. With almost zero education necessary, the average American had little difficulty securing a manufacturing job with pay and benefits that could readily support a middle class...

    Show more...

  • The New Factory Floor

    Posted by on August 26th, 2014

    I get a lot of weird promoted tweets that I often make fun of, however, one I got last night made me think.

    Bv8B-EpCQAAI2qz

    So, ostensibly this is meant to entice potential job candidates. However, the only thing discussed is the set of technology in use. There’s no mention of the mission, what the company does, or who you’d be working with. This got me thinking about a number of trends I’ve seen recently in the technology world: the rise of so called “hacker schools”, the talent crunch, and the growing popularity of server side JavaScript. Suddenly, to be eminently employable at a relatively high level of income one need only attend a 2-3 month course, learn JavaScript and put together a few portfolio projects. The ability to do it all only learning one language lowers the barriers to entry further than they’ve ever been.

    These trends remind me of the first rise of manufacturing jobs in the US after the second world war. With almost zero education necessary, the average American had little difficulty securing a manufacturing job with pay and benefits that could readily support a middle class lifestyle. There are major differences, of course, between then and now. No matter how drastic the talent shortage seems to be, it seems unlikely that we will ever see numbers as high as a 25% of all jobs going to programming, which was the peak for manufacturing.

    What will be interesting to watch, and what ultimately concerned me about that job advertisement, is how much the supply of programmers will dictate technology choice. I’ve long argued against hiring based on familiarity with any particular technology, instead favoring the domain knowledge, ability and motivation of a candidate. However, many firms may see an advantage in standardizing their technology stack around something that they can specifically recruit towards. It will also be interesting to see how long it lasts. Market imbalances like the talent crunch cannot last forever. If demand doesn’t collapse due to larger circumstances such as a bubble burst, then other firms will step in to help automate away what would normally be hired into. One could argue that most SaaS companies are already doing this, piece by piece. However, unlike the manufacturing industry roughly the same set of skills are needed to displace a job as there are to do the job.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Sometimes even the easy things can take a long time

    Posted by on August 23rd, 2014

    The other day I decided to find out how easy or how difficult it was to get a custom metric in to our newly launched Boundary Premium service. I decided to do this totally on my own without any help or guidance from others within Boundary. I also decided to do it in Java, a language that I am reasonably proficient in but certainly not proficient enough as this experience proves.

    I read the documentation and it effectively said that I should http post a simple JSON object with the metric ID and the metric value to the measurements endpoint. The metric definition is set up through the application interface and was very straight forward.

    Easy Things - Settings

    (I later found out that I could do this dynamically from my application if I wanted).

    So, I wrote some code, rewrote some code, wrote some more code, deleted a lot of code and….3 hours later, I had managed to send some metric values. Why did it take so long? Because I didn’t know how to authenticate to a web site using basic authentication. In all my previous applications that had authenticated to a web site, I had used an authenticator method...

    Show more...

  • Sometimes even the easy things can take a long time

    Posted by on August 23rd, 2014

    The other day I decided to find out how easy or how difficult it was to get a custom metric in to our newly launched Boundary Premium service. I decided to do this totally on my own without any help or guidance from others within Boundary. I also decided to do it in Java, a language that I am reasonably proficient in but certainly not proficient enough as this experience proves.

    I read the documentation and it effectively said that I should http post a simple JSON object with the metric ID and the metric value to the measurements endpoint. The metric definition is set up through the application interface and was very straight forward.

    Easy Things - Settings

    (I later found out that I could do this dynamically from my application if I wanted).

    So, I wrote some code, rewrote some code, wrote some more code, deleted a lot of code and….3 hours later, I had managed to send some metric values. Why did it take so long? Because I didn’t know how to authenticate to a web site using basic authentication. In all my previous applications that had authenticated to a web site, I had used an authenticator method in my code. As it happens, that method does work with Boundary Enterprise but it didn’t work with Boundary Premium.

    It was basically because I didn’t really know what I was doing. Luckily the internet has all the answers and eventually I found the right one. So, to save anyone else falling in to a similar three hour trap, try this:

    import org.apache.commons.codec.binary.Base64;
    
    ...
    
    String authString = "{my email address}:{my api token}";
    
    byte[] authEncBytes = Base64.encodeBase64(authString.getBytes());
    
    con.setRequestProperty("Authorization", "Basic " + authStringEnc);

    I am sure that many experienced Java programmers will be amused at my lack of basic knowledge but sometimes we all need a little helping hand. I hope that the above will help someone somewhere get their custom metrics in to Boundary Premium faster than I did.

    Having said that, it was pretty straight forward once I had got over the rookie hurdle!

    Easy Things - Dashboard

    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Green cards and metrics

    Posted by on August 20th, 2014

    Quick story….last November I got married, went on honeymoon and when returning to the US (via Puerto Rico), realized that I had left my green card at home (guess I had other things on my mind). Several hours and multiple hundreds of dollars later, I was allowed to leave immigration at PR and return home – having been told that they had to cancel my green card because they could not let me in to the country without it.

    Now, I won’t begin to discuss the preposterous nature of this situation. You have all my records on file – my iris scan, my fingerprints, my passport, my countless trips out of the country over the last 13 years (and of course, many other things that I don’t even know about). I was even a member of the Global Entry Trusted Traveller program (after going through many background checks). But no, because I did not have that little piece of plastic the whole system broke down.

    I won’t bore you with the ensuing details apart from to say that it is now 9 months later, I have been to the immigration office many times and STILL don’t have my green card. Which means that...

    Show more...

  • Green cards and metrics

    Posted by on August 20th, 2014

    Quick story….last November I got married, went on honeymoon and when returning to the US (via Puerto Rico), realized that I had left my green card at home (guess I had other things on my mind). Several hours and multiple hundreds of dollars later, I was allowed to leave immigration at PR and return home – having been told that they had to cancel my green card because they could not let me in to the country without it.

    Now, I won’t begin to discuss the preposterous nature of this situation. You have all my records on file – my iris scan, my fingerprints, my passport, my countless trips out of the country over the last 13 years (and of course, many other things that I don’t even know about). I was even a member of the Global Entry Trusted Traveller program (after going through many background checks). But no, because I did not have that little piece of plastic the whole system broke down.

    I won’t bore you with the ensuing details apart from to say that it is now 9 months later, I have been to the immigration office many times and STILL don’t have my green card. Which means that every time I travel, I have to allow for about an extra hour or two to come through immigration because they always refer me to secondary, then eventually someone looks me up on a computer (imagine that) and says “OK, you’re free to go”. My family now comes through immigration separate to me because they are fed up waiting.

    Long story but I mention this purely because I was planning to stay in SF tonight with my wife/daughter but realized that I had an appointment with immigration early tomorrow morning in San Jose so instead I am at my house on my own, writing this post.

    It may of course all be a plan by our investors because when I am home alone, I tend to work and then work some more.

    Tonight it is metrics. I happen to love numbers – my kids think I am strange because of that and the rest of my family often refers to me with strange nicknames. But…and here is my question…..

    I would like to procure a simple to use, SaaS solution for metrics collection and reporting, where it has very easy to implement connectors to common tools that we use for our business….salesforce.com, Totango, Pardot, Quickbooks, Recurly, Desk.com etc.

    I don’t want a downloadable Windows package (SiSense), I don’t want to spend a small fortune (Domo), I don’t want something where I build my own connectors (GoodData) – please can I just get something that works and wake me from my excel nightmare!

    Add a comment or email me….I’ll personally send a personal bottle of wine to someone that recommends something that I end up actually using.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Customer success is all inclusive

    Posted by on August 19th, 2014

    I will use this blog to chat about life as the CEO of a startup. It will give insights into what really goes on, which is often (always?) very different than the marketing rhetoric.

    It might not always be enlightening, might not always be the best written posts, you might think it a complete waste of words but it will be real and honest.

    One of my areas of focus right now is to ensure that everyone at Boundary is constantly thinking about how to always be improving our customer experience.

    When I discuss with others on our team, they tell me that we must create defined programs and actions for individuals to take. I know that they are correct and I’m fortunate to be working with people that can implement my ranting, but what I really want to achieve is that everybody that works at Boundary simply has this as an underlying philosophy to everything they do. I don’t want us to wait to be asked, I want everyone to be proactive….see something that can be improved? then take action.

    We want our customers to never need to speak to us; of course we love communicating with our...

    Show more...

  • Customer success is all inclusive

    Posted by on August 19th, 2014

    I will use this blog to chat about life as the CEO of a startup. It will give insights into what really goes on, which is often (always?) very different than the marketing rhetoric.

    It might not always be enlightening, might not always be the best written posts, you might think it a complete waste of words but it will be real and honest.

    One of my areas of focus right now is to ensure that everyone at Boundary is constantly thinking about how to always be improving our customer experience.

    When I discuss with others on our team, they tell me that we must create defined programs and actions for individuals to take. I know that they are correct and I’m fortunate to be working with people that can implement my ranting, but what I really want to achieve is that everybody that works at Boundary simply has this as an underlying philosophy to everything they do. I don’t want us to wait to be asked, I want everyone to be proactive….see something that can be improved? then take action.

    We want our customers to never need to speak to us; of course we love communicating with our customers and we are constantly seeking feedback (how else do you learn?) but we want our products to be completely intuitive and always provide the answers to questions that the customer needs.

    Our customer success team works from the principle that we should never be asked the same question more than once. Either the product experience should be improved to ensure the question doesn’t need to be asked or docs should be updated. “How to” questions from customers are a huge opportunity to improve.

    But, the other area that I think might come as a surprise to some, is that this is an all-inclusive philosophy. It doesn’t matter whether you work in engineering, marketing, sales, customer success, product management, finance, operations, HR or anywhere else, every single person at Boundary can impact how our customers perceive us and therefore we each have a responsibility to play our part.

    A great product followed by incorrect invoicing can leave a bad taste. Misleading content on our web site can get the relationship off on the wrong foot. A support rep that commits to “get back to you tomorrow” and then takes off for the weekend is frustrating and annoying. A customer success rep that doesn’t return your email quickly makes you feel like a low priority.

    A customer said to me once “I know you must actually be really busy, but never once have you made us feel that you have anything else to do that is more important than we are”.

    That’s how I want our customers to feel.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Free Monitoring <3

    Posted by on August 7th, 2014

    When we talk to our customers there’s a few things we’ve heard above all else over the last several months: folks wanted a free offering, they wanted host level and generic metrics, and they wanted it all to be dead simple to setup and use. We listened – which is why we’re excited by this week’s release of free server monitoring. And early feedback has been fantastic.

    And it’s only going to get better from here. Sign up today and get your 10 free servers.

    10 Servers Free = Monitoring <3


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Free Monitoring <3

    Posted by on August 7th, 2014

    When we talk to our customers there’s a few things we’ve heard above all else over the last several months: folks wanted a free offering, they wanted host level and generic metrics, and they wanted it all to be dead simple to setup and use. We listened – which is why we’re excited by this week’s release of free server monitoring. And early feedback has been fantastic.

    And it’s only going to get better from here. Sign up today and get your 10 free servers.

    10 Servers Free = Monitoring <3


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Erlang MapReduce Queries, MultiFetch and Network Latency with Riak

    Posted by on June 25th, 2014

    I know, you’re looking at your calendar, let me be the first to assure you it’s not 2011. I recently had the need to write some Erlang MapReduce queries for Riak and it was a bit of an adventure. The Riak MapReduce documentation is good but generally focused on JavaScript. If you’re using Riak it’s quite possible you’ve never had the need to use its’ MapReduce capabilities. We haven’t really used it at Boundary before I dug into some performance problems and it’s probably not regarded as one of Riak’s strengths. With that said though it’s a nice feature and was worth some investigation.


    Slow code

    To provide a bit of context let me first describe the performance problem I was investigating. Boundary customers were experiencing poor response time from a service that is responsible for managing metadata for Boundary meters. The service is called Metermgr and it’s a webmachine/OTP application that relies on Riak for persistence and exposes meter metadata with a REST interface.

    I noticed that as the set of meters for an organization grew there appeared to be a simple regression in a...

    Show more...

  • Erlang MapReduce Queries, MultiFetch and Network Latency with Riak

    Posted by on June 25th, 2014

    I know, you’re looking at your calendar, let me be the first to assure you it’s not 2011. I recently had the need to write some Erlang MapReduce queries for Riak and it was a bit of an adventure. The Riak MapReduce documentation is good but generally focused on JavaScript. If you’re using Riak it’s quite possible you’ve never had the need to use its’ MapReduce capabilities. We haven’t really used it at Boundary before I dug into some performance problems and it’s probably not regarded as one of Riak’s strengths. With that said though it’s a nice feature and was worth some investigation.


    Slow code

    To provide a bit of context let me first describe the performance problem I was investigating. Boundary customers were experiencing poor response time from a service that is responsible for managing metadata for Boundary meters. The service is called Metermgr and it’s a webmachine/OTP application that relies on Riak for persistence and exposes meter metadata with a REST interface.

    I noticed that as the set of meters for an organization grew there appeared to be a simple regression in a certain queries response time. For queries with as little as 200 keys response time was between 2 – 4 seconds. After taking a look at the code I was able to pinpoint the cause of the slowdown to a function called multiget_meters. Unfortunately this function didn’t multiget anything rather it iteratively fetched them one by one, oof.



    Anyway, my initial thought was, “I’ll just use MultiFetch.”

    Does Riak support MultiFetch/MultiGet?

    If you’re familiar with the more popular Riak clients or search around the internet for “riak multiget” you might get the impression that Riak supports retrieving multiple values in a single HTTP or Protocol Buffers request, sometimes referred to as “multiget or multifetch”.

    Unfortunately that’s not the case, take a look at the source you’ll see that Riak itself doesn’t support these capabilities. Rather some Riak clients provide this functionality by parallelizing a set of requests and coalescing the results. The riak-java-client is one such example.



    Having had experience with the Java client I incorrectly assumed that the official Erlang client had a similar implementation but if you check out the source you’ll notice it doesn’t support MultiFetch. I did a bit of archeology and found there are a lot of posts with questions and requests around implementing multifetch in the Riak Erlang client. Most of these posts point the user towards using MapReduce. The most useful thread I could find on the subject can be found here, not surprisingly it is entitled multi-get-yet-again!


    MapReduce in Riak

    Implementing MultiFetch in Erlang wouldn’t be too difficult but several users reported very good performance using the MapReduce approach with the only caveat being:

    1. I heard MapReduce in Riak is slow (hearsay etc…).
    2. MapReduce queries in Riak clusters are run with a R=1.

    Unfortunately the latter is a serious problem and I would like to see it addressed but for now let’s disregard this as it’s outside the scope of the discussion. It’s fine, take him outside and show him the pool, get him a cookie, he’ll be fiiiiiiine, etc….

    The MapReduce docs on Basho’s website are pretty good but there’s a lot of data to sift through in order to find the most relevant pieces of information to get started quickly. After doing so though I’m pleased to say using Erlang MapReduce queries with Riak is quite easy and there’s really only 2 important pieces of information you need to know to get started.

    1. Riak has built-in Erlang MapReduce functions and you can use these to address many common use cases. You should learn how to use these first.
    2. You can write custom Erlang MapReduce functions but you need to compile and distribute the object code to all riak nodes.

    As noted in the docs the basic MapReduce function riakc_pb_socket:mapred/3 takes a client, a list of {Bucket, Key} tuples as input and a list of Erlang Queries. Let’s dig into the Query a bit more, it looks like the following

    {Type, FunTerm, Arg, Keep}
    
    Type - is an atom and is either map or reduce
    FunTerm - a tuple 
      for built-in functions use : {modfun, Module, Function}
      for custom functions use : {qfun, Fun}
    Arg - Static argument (any Erlang term) to pass to each execution of the phase
    Keep - True/False - Include results in the final value of the query
    

    The examples in the documentation focus heavily on writing your own qfun queries, though as I mentioned you can’t just use qfun without some upfront work, the documentation notes.

    Screen Shot 2014-06-25 at 6.44.01 PM

    In addition, there is another paragraph that in the section called “A MapReduce Challenge” that states.

    Screen Shot 2014-06-25 at 6.46.47 PM

    In summary, if you want to write custom MapReduce queries in Erlang you need to compile and distribute your code to Riak nodes. I’ve gotten so comfortable using erl as a REPL that I glossed over this and assumed I could simply pass functions references and they’d be evaluated. If you don’t take the time to read and fully understand the documentation you might skim past those qfun requirements and just start writing your own custom queries like me and this guy. Combine that with the fact that qfun MapReduce error messages are generally quite opaque and that can lead to a bit of frustration when getting started.

    I’d prefer the documentation break out the difference between built-in and qfun queries more clearly and focus on modfun examples initially with a separate qfun section, preferably with a big red callout yelling “Hey Dummy, don’t try this yet”. The JavaScript MapReduce API doesn’t suffer from this limitation of course because it’s JavaScript and is interpreted via the Spidermonkey JS engine that ships with Riak. Perhaps that and the recent popularity of JavaScript is why it is given much more attention in the docs.


    Simulating MultiFetch with Built-In MapReduce Queries

    So back to the point it’s best we understand the built-in queries before we go any further. Here’s a quick walk through of the default map functions that are provided.

    map_identity - Return a list of riak_object for each bucket/key
    map_object_value - Returns a list of values stored in each key (calls riak_object:get_value(RiakObject)) 
    map_object_value_list - calls riak_object:get_value(RiakObject) assumes get_value returns a list, returns a merged list
    

    There are reduce phases as well, but to achieve multifetch like capabilities we only need to concern ourselves with the map_object_value map function. We can achieve our original multifetch use case by substituting.

    for

    As expected a quick set of tests against the production cluster and we’ve reduced the query from 2 – 4 seconds down to an acceptable (albeit not blazingly fast) average of approximately ~115 milliseconds.


    Comparing to MultiFetch in Java

    These results of course got me thinking about how Erlang mapred would perform compared to MultiFetch in Java on the JVM and as such I decided it was worth gathering some data. I constructed a test for 20, 200, and 2000 keys (this is not a benchmark) and ran each of the 3 tests 100 times, gathered samples and calculated the average and variance. I ran the tests on a server in the same data center and on the same broadcast domain as the Riak cluster. As to be expected MultiFetch outperformed mapred and the latency of MultiFetch (as noted by Sean Cribbs and the Riak documentation) was more predictable.

    Response time in ms where network latency ranges between 0.1 – 0.4ms

    As the number of keys increased by orders of magnitude query response time becomes less predictable with both approaches though MapReduce’s variance is greater. Many raw samples with MapReduce fell within ~600ms but there also several samples between ~900ms and ~1400ms.


    When might MapReduce be faster?

    This had me wondering if there are any situations where MapReduce might be preferable to MultiFetch or should I always just use MultiFetch? It seems to be the prevailing sentiment, most in use by clients and even Basho sometimes seems reticent about suggesting the use of MapReduce. I decided to run the the same set of tests but this time I ran them from Metermgr running locally on my laptop connecting to the production Riak cluster over the VPN.

    Response time in ms where network latency ranges between 100 – 300ms

    While the results are somewhat expected they are interesting nonetheless. Initially with a key set of 20 MultiFetch overcomes the added network latency and outperforms MapReduce but as the key set grows by an order of magnitude the average MapReduce query time outperforms MultiFetch by a factor of 2. Average variance remains less predictable in MapReduce because adding network latency doesn’t affect the variance we experienced at sub-millisecond latency.

    We all know situating your application servers near your database is important for performance, but in an age of “hosted this and that”, “PaaS” and “DBaaS” as a developer you may end up using a database or service where network latency becomes a factor. In the above example using a MultiFetch approach network latency is compounded as the input set grows, whereas MapReduce takes that hit only once, hence the improved average response time.

    I would of course be remiss if I didn’t mention that Boundary is an exemplary tool to monitor performance of such different techniques and can provide 1 second resolution of average response time for Riak Protocol Buffer queries whether they are within the same data center or across the internet.

    Where to go from here?

    Well, I’ve got a solution for my performance problem that meets my near term needs. I’m interested into digging into alternative clients and seeing if a MultiFetch implementation for Riak exists in Erlang, if I don’t find one I like I will write my own. I also believe it’s incorrect to say “MapReduce in Riak is slow”, in fact under certain input constraints and configurations it is not only acceptable it is preferable to the MultiFetch approach, if latency predictability is not too much of a factor. The problem is more nuanced than “should I use MapReduce” and it’s more abstract than MapReduce and Riak. It is about read techniques and their performance within the constraints of a distributed system. There are problems and there are tools, we need to use the right tools to solve a problem in certain situations.

    I’m looking forward to digging into more custom Erlang queries and can already envision situations where Riak MapReduce might be favorable. Finally if you’re using Riak but haven’t dug into this custom MapReduce queries because you’re not comfortable with Erlang then it’s about time you learn you some.

    Special thanks to @pkwarren for peer review, without his grammatical support this post would be unreadable


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Web-Scale IT – “I know it when I see it…”

    Posted by on May 27th, 2014

    Recently at Boundary, we’ve been talking a lot about “Web-Scale IT”.  One of the first questions we usually get is,  “What exactly is Web-Scale IT?”  Cameron Haight of Gartner first coined this term in a 2013 blog  and said,  “What is web-scale IT?  It’s our effort to describe all of the things happening at large cloud services firms such as Google, Amazon, Rackspace, Netflix, Facebook, etc., that enables them to achieve extreme levels of service delivery as compared to many of their enterprise counterparts.”

    But when we answer this,  we are tempted to fall back on cliche.   In a famous opinion offered by Justice Potter Stewart in the 1964 case of Jacobellis vs. Ohio,  Stewart wrote;

    “I shall not today attempt further to define the kinds of material I understand to be (pornography)…But I know it when I see it…”

    That’s how we feel about Web-Scale IT, we have a hard time defining it, but we know it when we see it!

    We see it when we walk into an enterprise and hear more about the cloud than the datacenter.  We see it where release cycles are measured in weeks versus...

    Show more...

  • Web-Scale IT – “I know it when I see it…”

    Posted by on May 27th, 2014

    Recently at Boundary, we’ve been talking a lot about “Web-Scale IT”.  One of the first questions we usually get is,  “What exactly is Web-Scale IT?”  Cameron Haight of Gartner first coined this term in a 2013 blog  and said,  “What is web-scale IT?  It’s our effort to describe all of the things happening at large cloud services firms such as Google, Amazon, Rackspace, Netflix, Facebook, etc., that enables them to achieve extreme levels of service delivery as compared to many of their enterprise counterparts.”

    But when we answer this,  we are tempted to fall back on cliche.   In a famous opinion offered by Justice Potter Stewart in the 1964 case of Jacobellis vs. Ohio,  Stewart wrote;

    “I shall not today attempt further to define the kinds of material I understand to be (pornography)…But I know it when I see it…”

    That’s how we feel about Web-Scale IT, we have a hard time defining it, but we know it when we see it!

    We see it when we walk into an enterprise and hear more about the cloud than the datacenter.  We see it where release cycles are measured in weeks versus quarters.  We see it when tools like Chef are used for deployment.  We see it when we are talking to the head of DevOps.  Where there are sprints but not waterfalls.  Where the team is talking about continuous deployment, provisioning instances and open source components instead of next year’s release, hardware acquisition, and packaged software. When we see these things, we know we are seeing Web-Scale IT happening.

    The funny thing is, we see Web-Scale IT everywhere we look.  From the newest start-ups to the most conservative enterprises.  Web-Scale IT is not just for the Amazon’s, Google’s and Netflix’s of the world.  We see it at Fortune 500 insurance companies, health care companies and manufacturers.  At media companies, SaaS start-ups and service providers. In enterprises of every shape, size and flavor.

    Gene Kim, commenting on the adoption of DevOps in the enterprise, recently wrote in the CIO Journal,

    “The important question is why are they embracing something as radical as DevOps, especially given the conservative nature of so many enterprises? I believe it is because the business value of adopting DevOps work patterns is even larger than we thought. And those not transforming their IT organizations risk being left behind, missing out on one of the most disruptive and innovative periods in technology.”

    We couldn’t agree more. The confluence of Cloud, DevOps, Open Source, and competitive pressure have put us at a cross-roads in the history of Information Technology.  Web-Scale IT lets us build better applications, faster.  It lets us change them quicker.  And it lets us scale them more cost effectively and agilely.

    There is no doubt in our mind that Web-Scale IT is here to stay.  But Web-Scale IT is not without its challenges.  One of these challenges is ensuring high levels of service quality and delivery.  Boundary’s customers are some of the leading adopters of Web-Scale IT, whether they call it that or not.  We are excited to provide them a critical service  that helps them  successfully cope with the challenges of operating in this new, compelling environment, allowing them to anticipate and solve problems faster, and to keep up with the pace of application and infrastructure changes that are typical of Web-Scale implementations.

    So while it might not be easy to define Web-Scale IT, we know it when we see it, we are seeing it everywhere, and we are doing our best in helping our customers to make it deliver on its huge promise.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • A “Quantum Theory” of IT Monitoring

    Posted by on May 20th, 2014

    There are certain things, which are true in the quantum world, but just make no sense in our reality.  I remember in a college advanced physics course, having to calculate the likelihood that a baseball thrown at a window will pass through, emerge on the other side and leave both the ball and the window intact, due to quantum effects and tunneling.  I was astonished to see that while the odds of this happening are infinitesimally small, they are not zero.  Never mind the fact that you’d have to continuously throw the ball at the window, not accounting for breakage, for longer than the universe has existed to even have a remote chance of observing this, the odds are not zero and can be calculated.   And at the sub-atomic level, not the physical object one, this  type of behavior isn’t just common, it is expected.  This small fact has stuck with me for decades as a great illustration of how odd the quantum world truly is.

    What then does that possibly have to do with IT Monitoring?  It might be a stretch, but I think the new world of applications, which we call...

    Show more...

  • A “Quantum Theory” of IT Monitoring

    Posted by on May 20th, 2014

    There are certain things, which are true in the quantum world, but just make no sense in our reality.  I remember in a college advanced physics course, having to calculate the likelihood that a baseball thrown at a window will pass through, emerge on the other side and leave both the ball and the window intact, due to quantum effects and tunneling.  I was astonished to see that while the odds of this happening are infinitesimally small, they are not zero.  Never mind the fact that you’d have to continuously throw the ball at the window, not accounting for breakage, for longer than the universe has existed to even have a remote chance of observing this, the odds are not zero and can be calculated.   And at the sub-atomic level, not the physical object one, this  type of behavior isn’t just common, it is expected.  This small fact has stuck with me for decades as a great illustration of how odd the quantum world truly is.

    What then does that possibly have to do with IT Monitoring?  It might be a stretch, but I think the new world of applications, which we call Web-Scale, is in some ways as strange to traditional monitoring products as the world of Quantum behavior is to baseballs, windows and normal humans.

    Let me explain.   In the past, we built applications that were not quite so sensitive to small changes in infrastructure performance, for two main reasons.  First, our users had very low expectations.  From batch, to time sharing, to PC networks, to early web applications, we became accustomed to waiting for a screen to advance, an hour glass to spin, a web page to update.   But somewhere along the couple of years or so, our expectations have changed.  Movies stink when they stall, missed stock quotes can cost us real money, and we voraciously hang on our phones and tablets for real-time updates of everything from sporting events to natural disasters, to pictures and updates from loved ones, to new orders from customers.

    Second, we just got tired of the standard practice of over-provisioning data centers for peak loads, running at 50% capacity or less to ensure performance.  Despite falling hardware costs, our appetites for data and applications just kept growing.  So we virtualized everything, and when we tapped out the efficiency there, just like we stopped building power plants at office buildings decades ago, we went to the cloud, where we could “scale” on demand, and share the economies of scale of computing experts.

    Yet while the entire infrastructure changed, and the costs of performance delays and degradations increased, we happily kept monitoring things every five minutes or so, or even every hour checking for the same things we used to, capacity, resource utilization, and the like.  Yet  today users scream and customers leave over  5 second delays.  Outages of streaming information costs us money.  Our “quantam” of time we care about has shrunk dramatically to match the needs of the new application infrastructure, applications and user expectations.  We live in a real-time world, yet we continue to monitor our last architecture.

    Which brings me to another engineering theorem deep in my memory.   The Nyquist–Shannon sampling theorem, which in its simplest form, says that in order not to lose information, the sampling frequency that you measure at needs to be at least 2x as fast as the event you want to capture.  If any slower, your reconstructed signals suffers from “aliasing”, or loss of information.

    Today’s Web-Scale IT architecture and demanding users, care about changes and delays that last a few seconds, sometime even less.   If our quantum of caring, is now measured in a second or two,  Nyquist, and common sense, says we better be capturing  and processing monitoring data every second or so also.

    Last generation IT monitoring solutions simply CAN’T capture and process data fast enough.  They can stare all day at the baseball, but it will never tunnel through the window.  But unlike our quantum baseball example, the slow sampling of Infrastructure monitoring data leaves us blind to things that happen that we actually care about.  Stalled video, missed quotes, lost business opportunity, service delays and outages that cost us money.

    Our new math of IT monitoring needs to measure in seconds, it’s as plain and simple to see as the shattered window that I am staring at right now.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Dynamic Tuple Performance On the JVM

    Posted by on May 15th, 2014

    There are lots of good things about working on the JVM, like the HotSpot JIT, operating system threads, and a parallel garbage collector. However, one limiting factor can often be the interaction between primitive types and referenced types in Java. Primitive types are the built in types that represent integral numbers, floating point numbers, and boolean yes/no values. Primitives are memory efficient: they get allocated either on the stack if they’re being used in a method, or inlined in an object when they’re declared as field members. They also wind up being fast because the JIT can often optimize their access down to a single CPU instruction. This works really well when you know what types a class will hold as its state beforehand. If, on the other hand, you don’t know what an object or array will hold at compile time the JVM forces you to box primitives. Boxing means that the primitives get wrapped in a heap allocated object, and their container will hold a reference to them. That type of overhead winds up being inefficient both in access time and memory space. Access...

    Show more...

  • Dynamic Tuple Performance On the JVM

    Posted by on May 15th, 2014

    There are lots of good things about working on the JVM, like the HotSpot JIT, operating system threads, and a parallel garbage collector. However, one limiting factor can often be the interaction between primitive types and referenced types in Java. Primitive types are the built in types that represent integral numbers, floating point numbers, and boolean yes/no values. Primitives are memory efficient: they get allocated either on the stack if they’re being used in a method, or inlined in an object when they’re declared as field members. They also wind up being fast because the JIT can often optimize their access down to a single CPU instruction. This works really well when you know what types a class will hold as its state beforehand. If, on the other hand, you don’t know what an object or array will hold at compile time the JVM forces you to box primitives. Boxing means that the primitives get wrapped in a heap allocated object, and their container will hold a reference to them. That type of overhead winds up being inefficient both in access time and memory space. Access time suffers because this type of layout breaks locality of reference. The extra allocations and garbage generated also put pressure on the JVM’s garbage collector, which can often be a cause of long pause times.

    We wrote FastTuple to try and help solve this problem. FastTuple generates heterogeneous collections of primitive values and ensures as best it can that they will be laid out adjacently in memory. The individual values in the tuple can either be accessed from a statically bound interface, via an indexed accessor, or via reflective or other dynamic invocation techniques. FastTuple is designed to deal with a large number of tuples therefore it will also attempt to pool tuples such that they do not add significantly to the GC load of a system. FastTuple is also capable of allocating the tuple value storage entirely off-heap, using Java’s direct memory capabilities.

    FastTuple pulls off its trick via runtime bytecode generation. The user supplies it with a schema of field names and types. That schema is then built into a Java class definition which will contain accessor methods and either field definitions or the memory address for an off heap allocation, depending on which storage method was requested. The resulting Java class gets compiled into bytecode and then loaded as a reflective Class object. This Class object can then be used to create instances of the new class.

    Performance

    To understand the performance of FastTuple it’s necessary to have a good understanding of the relative cost of things on the JVM. To that end we wrote a microbenchmark in FastTuple to demonstrate the relative cost of writing and then reading several fields on a container, whether that be a Java object, an array, or a List. The code can be found here, and we’d love for you to run it on your own. The timings shown here are from a late 2013 Macbook Pro with a 2.6ghz Intel Core i7 running the 1.8.0_05-b13 build of Java 8.

    public long testClass() {
        Container container = new Container(0, 0, (short)0);
        container.a = 100;
        container.b = 200;
        container.c = 300;
        return container.a + container.b + container.c;
    }
    c.b.t.AccessMethodBenchmark.testClass     thrpt     1676855.039 ops/ms
    
    public long testLongArray() {
        long[] longs = new long[3];
        longs[0] = 100L;
        longs[1] = 200;
        longs[2] = (short)300;
        return longs[0] + longs[1] + longs[2];
    }
    c.b.t.AccessMethodBenchmark.testLongArray thrpt     1691027.650 ops/ms

    This is our baseline. If there’s a way to way to write to memory faster than this in Java, I don’t know of it. When we look at the assembly that eventually gets emitted by the JIT, it looks like this:

      0x000000010524e482: mov    0x60(%r15),%rax
      0x000000010524e486: lea    0x20(%rax),%rdi
      0x000000010524e48a: cmp    0x70(%r15),%rdi
      0x000000010524e48e: ja     0x000000010524e508
      0x000000010524e494: mov    %rdi,0x60(%r15)
      0x000000010524e498: mov    0xa8(%rdx),%rcx
      0x000000010524e49f: mov    %rcx,(%rax)
      0x000000010524e4a2: mov    %rdx,%rcx
      0x000000010524e4a5: shr    $0x3,%rcx
      0x000000010524e4a9: mov    %ecx,0x8(%rax)
      0x000000010524e4ac: xor    %rcx,%rcx
      0x000000010524e4af: mov    %ecx,0xc(%rax)
      0x000000010524e4b2: xor    %rcx,%rcx
      0x000000010524e4b5: mov    %rcx,0x10(%rax)
      0x000000010524e4b9: mov    %rcx,0x18(%rax)    ;*new  
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@0 (line 160)
    
      0x000000010524e4bd: movabs $0x64,%r10
      0x000000010524e4c7: mov    %r10,0x10(%rax)    ;*putfield a
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@15 (line 161)
    
      0x000000010524e4cb: movl   $0xc8,0xc(%rax)    ;*putfield b
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@22 (line 162)
    
      0x000000010524e4d2: mov    $0x12c,%esi
      0x000000010524e4d7: mov    %si,0x18(%rax)     ;*putfield c
    ; - com.boundary.tuple.AccessMethodBenchmark::testClass@29 (line 163)
    
      0x000000010524e4db: movabs $0x258,%rax
      0x000000010524e4e5: add    $0x70,%rsp
      0x000000010524e4e9: pop    %rbp
      0x000000010524e4ea: test   %eax,-0x225a3f0(%rip)        # 0x0000000102ff4100
                                                    ;   {poll_return}
      0x000000010524e4f0: retq

    The assembly is helpfully annotated with the corresponding java source in comments. The preamble is taking care of the allocation, the actual field writing takes only a handful of instructions, and then it cheats on the return side, moving the value 600 directly into RAX as the return value. The assembly emitted for testLongArray is almost identical, except it doesn’t cheat at computing the return value.

    Next down on the performance ladder is manipulating off-heap memory using a Sun JVM builtin class called Unsafe.

    public long testOffheapDirectSet() {
        unsafe.putLong(record2 + 0L, 100);
        unsafe.putInt(record2 + 8L, 200);
        unsafe.putShort(record2 + 12L, (short)300);
        return unsafe.getLong(record2 + 0L) + unsafe.getInt(record2 + 8L) + 
               unsafe.getShort(record2 + 12L);
    }
    testOffheapDirectSet         thrpt      948934.710 ops/ms
    
    public long testOffheapAllocateAndSet() {
        long record = unsafe.allocateMemory(8 + 4 + 2);
        unsafe.putLong(record, 100);
        unsafe.putInt(record+8, 200);
        unsafe.putShort(record+12, (short)300);
        long r = unsafe.getLong(record) + unsafe.getInt(record+8) + 
                 unsafe.getShort(record+12);
        unsafe.freeMemory(record);
        return r;
    }
    testOffheapAllocateAndSet    thrpt        7604.148 ops/ms

     

    So what’s going on here? In the first test all we’re doing is setting the memory for our three “fields” in a chunk of memory that’s been allocated outside of the benchmark. In the second test we’re doing the actual allocation in addition to setting the memory. The performance disparity can be explained by the way in which Unsafe is implemented; everything in Unsafe is native C++. But some of the methods are what’s known as intrinsics. On the JVM an intrinsic is more or less a macro that will get replaced with inlined assembly. This allows for native and potentially unsafe operations without the substantial overhead of making a JNI call.

    Unfortunately, Unsafe.allocateMemory is not an intrinsic, so it incurs the full overhead of a JNI call. This explains the performance disparity between testOffheapAllocateAndSet and testOffheapDirectSet. The performance different between testOffheapDirectSet and bare field manipulation, however is a bit more subtle. It’s true that the calls to putLong and friends get inlined, but the JIT cannot optimize them to the same degree as the raw Java code.

      0x0000000110b3ba29: and    $0x1ff8,%edi
      0x0000000110b3ba2f: cmp    $0x0,%edi
      0x0000000110b3ba32: je     0x0000000110b3baaa  ;*getstatic unsafe
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@0 (line 136)
    
      0x0000000110b3ba38: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@4 (line 136)
    
      0x0000000110b3ba3c: movabs $0x64,%rdi
      0x0000000110b3ba46: mov    %rdi,(%rax)
      0x0000000110b3ba49: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@19 (line 137)
    
      0x0000000110b3ba4d: movabs $0x8,%rdi
      0x0000000110b3ba57: add    %rdi,%rax
      0x0000000110b3ba5a: mov    $0xc8,%ebx
      0x0000000110b3ba5f: mov    %ebx,(%rax)
      0x0000000110b3ba61: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@36 (line 138)
    
      0x0000000110b3ba65: movabs $0xc,%rbx
      0x0000000110b3ba6f: add    %rbx,%rax
      0x0000000110b3ba72: mov    $0x12c,%edx
      0x0000000110b3ba77: mov    %dx,(%rax)
      0x0000000110b3ba7a: mov    0x10(%rsi),%rax    ;*getfield record2
    ; - com.boundary.tuple.AccessMethodBenchmark::testOffheapDirectSet@53 (line 139)
    
      0x0000000110b3ba7e: mov    (%rax),%rsi
      0x0000000110b3ba81: mov    %rax,%rdx
      0x0000000110b3ba84: add    %rdi,%rdx
      0x0000000110b3ba87: mov    (%rdx),%edi
      0x0000000110b3ba89: add    %rbx,%rax
      0x0000000110b3ba8c: movswl (%rax),%eax
      0x0000000110b3ba8f: movslq %edi,%rdi
      0x0000000110b3ba92: add    %rdi,%rsi
      0x0000000110b3ba95: movslq %eax,%rax
      0x0000000110b3ba98: add    %rax,%rsi
      0x0000000110b3ba9b: mov    %rsi,%rax
      0x0000000110b3ba9e: add    $0x50,%rsp
      0x0000000110b3baa2: pop    %rbp
      0x0000000110b3baa3: test   %eax,-0x2479a9(%rip)        # 0x00000001108f4100
                                                    ;   {poll_return}
      0x0000000110b3baa9: retq

    It’s unclear at this point which technique would win in a real life scenario. However, if you’re storing a massive amount of data off heap it is likely that the GC savings will more than pay for any performance degradation in accessing the data.

    The good news is that these tests appear to be our baseline for manipulating on heap and off heap memory within the JVM. Using these numbers we can reason about the overhead of the various layers of abstraction that are being introduced, and what the right tradeoff is for your application.

    Memory Alloc Access Throughput ops/ms
    Direct Allocate N/A 6956.274
    Direct Deque Indexed 146534.498
    Direct Pool Eval 49921.211
    Direct Pool Indexed 55483.808
    Direct Pool IndexedBoxed 36165.749
    Direct Pool Iface 55885.570
    Direct Prealloc Eval 314968.430
    Direct Prealloc Indexed 367886.412
    Direct Prealloc IndexedBoxed 102979.196
    Direct Prealloc Iface 347002.180
    Heap Allocate N/A 962680.613
    Heap Deque Indexed 170232.606
    Heap Pool Eval 49065.286
    Heap Pool Indexed 60376.541
    Heap Pool IndexedBoxed 38744.961
    Heap Pool Iface 60029.537
    Heap Prealloc Eval 392755.176
    Heap Prealloc EvalField 563205.486
    Heap Prealloc Indexed 509216.472
    Heap Prealloc IndexedBoxed 201109.726
    Heap Prealloc Iface 526641.511

    This table may look daunting, but it’s simply measuring the various combinations of features that can be used to access FastTuple instances and manipulate them. The key to decoding the results is as follows:

    • Direct – This means the tuple came from a TupleSchema configured to store data off heap.
    • Heap – The TupleSchema was configured for on heap allocation.
    • Allocate – The benchmark includes an allocation operation, both instance creation and any back allocation.
    • Deque – The tuple was taken from a simple j.u.ArrayDeque.
    • Pool – The tuple was taken from a TuplePool which involves both a j.u.ArrayDeque and a ThreadLocal lookup.
    • Prealloc – The tuple was passed in to the method preallocated.
    • IndexedBoxed – Access is via the boxed get and set methods in the FastTuple base class.
    • Indexed – Access is via the primitive getX and setX methods in the FastTuple base class.
    • Iface – Access is via an interface that the tuple was specified to implement.
    • Eval – Access is via an expression that was compiled into a dynamic class at runtime and then evaluated against the tuple.
    • EvalField – Only for the heap type. The expression is manipulating the tuple fields directly instead of calling accessor methods.

    With that in mind, what kind of conclusions can we draw from these benchmarks? For one, the indexed methods seem to do very well here. One thing to bear in mind, however, is that the indexes being given are specified as constants. In other words, we’re letting the program elide the step of figuring out the index, which might very well be expensive. I think it’s quite likely that in real world situations the best performance can be gleaned from using expressions to reify the runtime information about a tuple type into bytecode.

    Another conclusion is that parallelism has a real cost. As far as the processor is concerned there’s no substitute for having something sitting in a register ready to go. ThreadLocal is going to incur a lookup; behind the scenes there is a table mapping thread ID’s to their particular ThreadLocal variables. This lookup has a cost, and it’s currently one of those things that the JIT can’t look past and elide. That’s why FastTuple is so configurable. In order to get the best performance in your situation you need a certain amount of flexibility about the lifecycle and access capabilities of these tuples. So give FastTuple a try, or better yet fork it and submit a patch.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Meter 2.0 – Build Methodology

    Posted by on March 24th, 2014

    In our earlier “Boundary Meter 2.0 – Foundations” post, Brent included a section of discussion on how we currently build our meter software.  Boundary’s customers run a multitude of operating systems (and various versions of those operating systems) and several different CPU architectures, which equates to effort and resources on Boundary’s engineering team to provide all the necessary meter variants.  In addition to our current set of customers, the Boundary sales team perpetually engages with potential new customers, some of which have OS, OS version, and/or architecture requirements which we don’t currently have a meter build for.

    In this post, we’ll explore the meter building process at Boundary and how it satisfies our business requirements while not overburdening engineering team resources.

    Consistency

    Because Boundary’s meter is written in C, it requires proper compilation and linking to create a working executable for each platform we support.  In this situation, one approach for supporting multiple OSes is to use the usual/native development toolchain for each OS....

    Show more...

  • Boundary Meter 2.0 – Build Methodology

    Posted by on March 24th, 2014

    In our earlier “Boundary Meter 2.0 – Foundations” post, Brent included a section of discussion on how we currently build our meter software.  Boundary’s customers run a multitude of operating systems (and various versions of those operating systems) and several different CPU architectures, which equates to effort and resources on Boundary’s engineering team to provide all the necessary meter variants.  In addition to our current set of customers, the Boundary sales team perpetually engages with potential new customers, some of which have OS, OS version, and/or architecture requirements which we don’t currently have a meter build for.

    In this post, we’ll explore the meter building process at Boundary and how it satisfies our business requirements while not overburdening engineering team resources.

    Consistency

    Because Boundary’s meter is written in C, it requires proper compilation and linking to create a working executable for each platform we support.  In this situation, one approach for supporting multiple OSes is to use the usual/native development toolchain for each OS.  But this approach requires familiarity with each dev toolchain being used and can be difficult to automate in a consistent fashion across all build environments.  And certainly this approach can grow more complex over time as new OSes are added.

    To accommodate our need for supporting an ever-changing list of supported OSes/architectures, the Boundary meter build centers around the GNU build tools (a.k.a. Autotools).  This set of tools provides a big value-add by allowing us to maintain a common, consistent build methodology across all meter builds.  For any particular meter we want to build, it’s as simple as navigating to the top-level source directory and running “make” to crank a new meter out.  Even building the meter for a Microsoft Windows client works in the same fashion!  This approach also lends itself well for consistency in automating.

    Speaking of our Windows client meter build, certain builds will require some additional effort to setup appropriate cross-compilers or emulated environments.  We currently use an Ubuntu (amd64) system to execute our Windows client meter builds as well as the meter builds for ARM architectures.  MinGW provides a cross-compiler that allows us to use our Ubuntu build system for creating a proper Microsoft Windows executable meter binary.  For our ARM-architecture armel and armhf builds (which we currently support for both Ubuntu’s Precise Pangolin and Debian’s Wheezy releases), we leverage the machine-emulation capability of QEMU, making it possible to use the native (not cross-compiled!) ARM toolchain for compiling and linking the meter software.  Additionally, we take advantage of the pdebuild utility for our Debian and Ubuntu builds (including both ARM builds), which simplifies download and installation of the appropriate toolchain and environment.

    Keep it Simple

    In many situations, it requires some level of additional effort to make aspects/areas of a project “simple to use and maintain” as opposed to just “cranking it out”/”making it work”.  Sometimes just a bit of up-front planning/thinking is the only additional effort required.  And sometimes it does payoff to “go for simple”!

    One technology we’ve utilized to simplify our meter software builds are Virtual Machines, specifically QEMU (with KVM).  This allows us to have one physical server with multiple guest OS VMs.  In our case, an Ubuntu (amd64) system is the host OS on our physical build server, and we run virtual instances of FreeBSD, Ubuntu, SmartOS, OpenSUSE, and Gentoo for handling our builds.  This leaves us with only one physical system to admin/maintain, while providing us the flexibility to easily add additional OSes (i.e. creating new VMs) when needed.

    Keeping our number of build VMs to a minimum is another way we’ve simplified.  Because our meter is statically linked with many of the libraries it requires, the meter built under Ubuntu will execute correctly on a number of other Linux distributions we support.  To satisfy our customers who use Red Hat Enterprise Linux (RHEL), CentOS, and Fedora, we offer a Boundary meter RPM package file which is actually created within our Ubuntu build VM using the mock utility.  This approach removes the need for an additional build VM of RHEL (or CentOS or Fedora) to create RPM packages.

    Package Accordingly

    In order to meet our customers’ expectations for proper installation, configuration, and removal of Boundary meter software, we provide an appropriate meter package for each platform we support.  This means providing a Debian package for Ubuntu and Debian Linux distributions, an RPM package for Red Hat, CentOS, Fedora, and OpenSUSE Linux distributions, a properly-signed Windows MSI file for Microsoft Windows, etc..

    While the majority of Boundary’s meter packages are created using usual and well-documented methods, the process we use for packaging the Microsoft Windows meter is interesting in that it’s done entirely under Linux!  Using WINE on our Linux build VM, we can execute the Windows binaries required to create a nice, properly-signed MSI file. Some of these Windows binaries we use include:

    • Microsoft’s HTML Help compiler for creating the usual “compiled help” (.chm) file for Windows
    • WiX for creating the MSI file (we also use wixwine‘s wrappers of the WiX binaries to simplify our execution of those binaries via WINE)

    Once we have a MSI file, we give it a valid signature using the OpenSSL-based signcode utility (a.k.a. osslsigncode).  Now we have a valid package for installation on Microsoft Windows!

    Automate

    For companies with limited engineering resources, automation can be a very valuable player in the build process.  Considering the many combinations of OS/version/architecture our meter software must support and provide packages for, we have build-and-package scripts that are used by Jenkins, a continuous integration tool, to help keep things manageable for our engineering team.

    Our build-and-package scripts provide a simple, command-line mechanism to build the meter software and create the associated package file.  This makes it easy for an engineer or QA team member to generate a meter package (while reducing the potential for mistakes).

    Giving Jenkins the ability to use our build-and-package scripts is where things get interesting.  We define a handful of “jobs” in Jenkins, and these tell Jenkins how to invoke the build-and-package scripts (along with which build VM(s) are valid for Jenkins to login and use when building a particular meter and package). Once created, these Jenkins jobs can be scheduled to run automatically and can also be started from one click in a web browser, making it easy for folks outside of engineering and QA to generate a meter package.  We also create a top-level job which just invokes all the meter jobs, making it easy to build-and-package all supported meters with a single click.

    Additionally, to help keep the number of jobs manageable, we utilize the “Configuration Matrix” (with “Combination Filter”) job feature of Jenkins.  This allows us to use a single Jenkins job for executing many different meter builds (as opposed to creating many separate jobs to cover all these builds individually).  Here’s what the “Configuration Matrix” looks like for our Debian/Ubuntu build job, reflecting the 12 valid (blue dots) distribution and architecture combinations this Jenkins job will handle:

    jenkinsjobmatrix

    As you can see, this job will build Boundary meter packages (both x86 32-bit and 64-bit) for five Debian and Ubuntu distros, as well as armel and armhf packages for Debian’s Wheezy and Ubuntu’s Precise distros.  Those combinations with a gray dot are not supported and will not be built by this job.

    Concluding Thoughts

    We’ve discussed a number of build-related strategies and practices the Boundary engineering team employs in the meter software development and release cycles.  While these practices usually involve some small level of effort (e.g. adding Autotools support for a new build, adding packaging creation for an OS other than the host OS, automating tasks/builds), they definitely facilitate our ability to enhance and maintain the Boundary meter software in a way which satisfies the business requirements (and customer expectations) while not overburdening engineering’s resources.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Boundary Meter 2.0.3 adds STUN-ing new features!

    Posted by on March 17th, 2014

    Boundary Meter 2.0.3 was just released and now includes STUN support. By using STUN, the Boundary Meter can automatically discover it’s public IP address even when it’s behind a firewall or NAT device.

    Once Boundary knows the public IP address, it can use it to correlate public and private network flows. For instance, if two servers connect to each other via a proxy, Boundary can use the public IP information to assemble the two independent flows on either side of the proxy into a single conversation.

    Ultimately, this provides deeper insight into how servers and virtual instances are communicating. This is always helpful when troubleshooting performance problems. Below is a screenshot of a meter with both public and private IP addresses.

    Boundary Meter View with Box v2

    In addition to STUN support, Boundary Meter 2.0.3 also includes the following  highly-valuable enhancements::

    • Ability to enable promiscuous mode for packet capture
    • Option to disable the built-in NTP client
    • Support for running the meter on Linux Mint

    See the release notes for more information about these features and full list of bug...

    Show more...

  • Boundary Meter 2.0.3 adds STUN-ing new features!

    Posted by on March 17th, 2014

    Boundary Meter 2.0.3 was just released and now includes STUN support. By using STUN, the Boundary Meter can automatically discover it’s public IP address even when it’s behind a firewall or NAT device.

    Once Boundary knows the public IP address, it can use it to correlate public and private network flows. For instance, if two servers connect to each other via a proxy, Boundary can use the public IP information to assemble the two independent flows on either side of the proxy into a single conversation.

    Ultimately, this provides deeper insight into how servers and virtual instances are communicating. This is always helpful when troubleshooting performance problems. Below is a screenshot of a meter with both public and private IP addresses.

    Boundary Meter View with Box v2

    In addition to STUN support, Boundary Meter 2.0.3 also includes the following  highly-valuable enhancements::

    • Ability to enable promiscuous mode for packet capture
    • Option to disable the built-in NTP client
    • Support for running the meter on Linux Mint

    See the release notes for more information about these features and full list of bug fixes.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Introducing the new Boundary User Interface

    Posted by on February 27th, 2014

    Our sole focus at Boundary is to make it easier for IT Ops to troubleshoot and resolve problems. When we redesigned the Boundary User Interface (UI) our goal was to make it easier and faster for IT Ops to find the information needed to resolve IT outages and diagnose performance problems.

    One of the first improvements you’ll notice in the UI is the new navigation model based around the “filter bar.” The filter bar lets users set the “time range” and “source” that is used to filter the data shown in the view. Below is a screen shot of the new UI and its three main components: the “filter bar” highlighted in red, the “view” highlighted in yellow and the “navigation bar” highlighted in orange.

    Filter Bar 4

    The filter bar is especially helpful when a user is trying to investigate a problem. Typically, a user will start with the events view to see the event details for the specific time range related to a group of servers or virtual machines. Once a user has an understanding of those events they can quickly move to the streams view and start examining the flow data...

    Show more...

  • Introducing the new Boundary User Interface

    Posted by on February 27th, 2014

    Our sole focus at Boundary is to make it easier for IT Ops to troubleshoot and resolve problems. When we redesigned the Boundary User Interface (UI) our goal was to make it easier and faster for IT Ops to find the information needed to resolve IT outages and diagnose performance problems.

    One of the first improvements you’ll notice in the UI is the new navigation model based around the “filter bar.” The filter bar lets users set the “time range” and “source” that is used to filter the data shown in the view. Below is a screen shot of the new UI and its three main components: the “filter bar” highlighted in red, the “view” highlighted in yellow and the “navigation bar” highlighted in orange.

    Filter Bar 4

    The filter bar is especially helpful when a user is trying to investigate a problem. Typically, a user will start with the events view to see the event details for the specific time range related to a group of servers or virtual machines. Once a user has an understanding of those events they can quickly move to the streams view and start examining the flow data statistics to understand how the network is being impacted. With Boundary’s new navigation model, the process is simple because the filter bar preserves the troubleshooting context as user’s move around the application. Hopefully, the usability flow is so natural that you don’t even notice it as you move around Boundary.

    One other subtle but important change in the UI was moving the navigation bar to a vertical column on the left side (see orange box above). Several of our beta users said they needed more vertical space when they worked on small monitors. By making the navigation bar vertical, we were able to free up precious vertical space giving users a much better experience on small monitors.  I want to thank all of our beta users as feedback like this really helps us improve Boundary. For more release details visit our whats new page.

    We hope the new Boundary UI lets you find and resolve problems faster and we would really appreciate your feedback. Please send any comments or questions directly to brandon@boundary.com.


    Tweet this!
    Share on LinkedIn
    Send to a friend
  • Wattpad uses Boundary and AWS to help scale storytelling platform

    Posted by on February 20th, 2014

    Growing social platform anticipates performance issues and fixes problems faster with Boundary.

     

    Mountain View, California — February 20, 2014 Boundary announces that Wattpad, the world’s largest community of readers and writers, is using its service to monitor and improve performance on its Amazon cloud-hosted infrastructure. In the last two years, Wattpad has experienced explosive growth with 20 million people joining the community. “Our tremendous growth stems from the fact that we’re offering readers and writers something they’ve never had before—a direct connection with each other,” says Charles Chan, head of engineering at Wattpad.

    With marketplace traction, however, Wattpad needed comprehensive strategies for performance. The company hosts its website 100% on AWS public cloud, and it needs early insight into anomalies. While Wattpad deploys industry best practices including switching between different AWS zones for optimal reliability, the engineering team is always looking for better visibility into system hotspots and unplanned downtime.

    Wattpad uses...

    Show more...

  • Wattpad uses Boundary and AWS to help scale storytelling platform

    Posted by on February 20th, 2014

    Growing social platform anticipates performance issues and fixes problems faster with Boundary.

     

    Mountain View, California — February 20, 2014 Boundary announces that Wattpad, the world’s largest community of readers and writers, is using its service to monitor and improve performance on its Amazon cloud-hosted infrastructure. In the last two years, Wattpad has experienced explosive growth with 20 million people joining the community. “Our tremendous growth stems from the fact that we’re offering readers and writers something they’ve never had before—a direct connection with each other,” says Charles Chan, head of engineering at Wattpad.

    With marketplace traction, however, Wattpad needed comprehensive strategies for performance. The company hosts its website 100% on AWS public cloud, and it needs early insight into anomalies. While Wattpad deploys industry best practices including switching between different AWS zones for optimal reliability, the engineering team is always looking for better visibility into system hotspots and unplanned downtime.

    Wattpad uses several tools for infrastructure monitoring, yet the company didn’t have a consistent method of tracking network bandwidth usage and traffic patterns. The engineering team began hunting for a new toolset, and determined that Boundary could fill the gap. “With Boundary, we would have been able to pinpoint performance issues much faster including determining if certain availability zones were in trouble.”

    Since deploying Boundary’s cloud-based consolidated operations management software, the company has found additional benefits beyond proactively monitoring uptime on AWS. Wattpad used Boundary to isolate an issue within the search application that caused a system outage.  Wattpad also used Boundary to identify possible breaking points in the website to prepare for 2013 holiday season peak traffic. “Boundary gives us an edge because we can constantly monitor the network traffic across all nodes within the system, and magnify issues that need to be handled quickly,” Chan says. “We can also anticipate the impact of changes as we scale and locate areas for optimizing the website. The Boundary staff worked with us over a period of several months to demo the system in our environment and they’ve always been there to help us quickly when we needed it.”

    “Wattpad’s business is defined by high volumes of daily users and constantly updated data streams and it depends on the scalability and flexibility it gets with AWS,” says Gary Read, CEO at Boundary. “Our cloud-based operations monitoring software is designed to monitor dynamic and always-changing environments like Wattpad. We are excited to help this innovative social platform grow and succeed around the world, and in the cloud.”


    Tweet this!
    Share on LinkedIn
    Send to a friend