## Cloud Environments: Dynamic, Sometimes Turbulent

Posted by on May 20th, 2013

Cloud computing has become pervasive. And while some cloud environments are simple enough for casual consumers to use, the cloud has not done away with IT, as it was once widely expected to do. Instead, the cloud has taken center stage among the challenges facing IT operations and DevOps teams.

Cloud environments are dynamic (i.e. varied and highly fluid). Though people still speak of “the cloud,” there are a wide range of clouds built around differing platforms and support languages. These clouds may be public, private, or hybrid, each with its own distinct constraints. Clouds may host applications, data, or support such as infrastructure. Even control systems for managing cloud environments can be hosted in the cloud.

Everything as a Service

Strictly speaking, cloud computing in and of itself is not a technology model. It is a business model, offering capabilities for rent rather than purchase. Call it Everything-as-a-Service (EaaS). This EaaS business capability is itself built on top of two technologies, the Web and virtualization.

Virtualized environments are inherently fluid and dynamic, even when locally hosted. Another layer of fluidity is added by the Web, which poses problems of network behavior, including latency issues, that are not under the full control of the user.

The infrastructure layer of the environment is particularly challenging from an operational and DevOps perspective, because new-gen applications interact with their environments in much more complex ways than older applications did. IT operations must thus be able to manage cloud infrastructure environments ranging from Cacti and Graphite to Nagios.

The Challenge of Managing EaaS

Finally, while the cloud itself is not a technology model, the cloud adds layers of dynamism – and resulting complexity challenges – that IT operations and DevOps teams must work with. These teams do not “own” the environments in which their applications, data, or other operations swim.

Just to enrich the challenge, many of these teams are themselves part of EaaS: They are serving their end-user customers from the cloud. Those end users don’t want to hear about cloud complexities and challenges. They just want their services to run smoothly from the sky.

All of this means that managing cloud environments place heavy demands on IT operations and DevOps teams. Because the environments are dynamic, their behavior needs to be tracked in real time. Instant notification of IT events is required, so that issues can be dealt with before they turn into problems. Control systems must work smoothly in environments provided by AWS, VMware, Joyent, Rackspace, and other cloud vendors.

Legacy IT control systems struggle to handle these fast-moving, complex – and external – environments. What IT needs is control systems that are adapted to the current-era management stack, and provide real-time context about complex environments without interfering with these environments. The good news is that a new generation of IT operations tools is now available that meet these standards and are capable of handling even turbulence in the clouds.

Image Source: Flickr

## Dennis Callaghan on the Changing World of IT Monitoring

Posted by on May 17th, 2013

Dennis Callaghan

Dennis Callaghan is a senior analyst on 451 Research’s Infrastructure Computing for the Enterprise (ICE) team, leading the firm’s coverage of application and Internet performance management, service-level monitoring and management, and IT asset and service management. Follow Dennis on Twitter at @DennisCallaghan.

What are the biggest challenges for small and large companies today in the area of IT operations management?

What comes back to us from the research we do, whether it’s about product trends or talking to customers, there are two things that stand out. First, the nature of distributed IT environments means that you need a good discovery plan in place to find out what is the infrastructure underpinning the applications, and where that infrastructure is located. Then, whether the application is running in traditional environments or in the cloud, you need to diagnose the performance issue and the impact on the end-user―what they are experiencing.

What performance metrics are companies most interested in tracking these days?

Response time is a key measure. How long do you have to wait to get a response or result from doing a transaction or a task? It’s always important to look at backend metrics like CPU and I/O but mostly, you need to understand how users are affected. And if you’re running in the cloud, you really need to understand latency.

With the cloud and distributed environments, have these metrics changed or increased in number?

The metrics themselves haven’t changed, but it’s more that the infrastructure is different and the environment is more complex. A decade or so ago, companies were primarily looking at the app server, the Java server. Now, they are also looking at network service levels, database performance, the storage array, the web server and they also need to get some decent end user metrics at the browser level. So there are a lot of different areas that people are looking at to get a more complete view of app performance, and when there’s an issue, they have to be able to triage it and figure out how to best remedy it. The reason behind this drive for a more comprehensive view is partly the demand for instant results. As an example, Google search now serves up results when you start typing a word into the search box. Yet the larger issue is that you have so many different systems interfacing with each other today. A few weeks ago, an IT end user  told me that he has several different systems that are exchanging process flows and he has no way to monitor the transaction performance between those systems, just within them. That is a challenge a lot of people are dealing with.

How are the tools evolving to meet these needs – and do you see the continuation of the need for many tools a.k.a. Best of Breed environment?

There are always going to be new issues for products to solve, and vendors tend to specialize in covering different layers of the infrastructure so having five or six tools is very common. Companies also don’t want to trust their environment to just one vendor. People often want an integrated console to see all this information in one place, but still, using multiple consoles is pretty common. There have been standardization drives to allow performance management tools from different vendors to work together, but they haven’t really gone anywhere. WSDM—Web Services Distributed Management—hasn’t been heard from in almost seven years now.

Are large companies switching away from their legacy APM and monitoring tools?

On a grand scale, no, not really. I think that’s true when they are launching development projects with new technologies such as Ruby on Rails and PhP. But they are not throwing out systems from CA, BMC, HP or IBM―at least not yet. But these one-off projects and proof of concepts with some of the newer vendors can certainly lead to that transition down the road.

Why not? Aren’t these tools poorly suited for new, distributed environments?

It has to do with the investment they’ve made in the legacy systems but also it has a lot to do with the back-end applications that the tools are monitoring. Believe it or not, mainframes are still very common especially in the Fortune 500. And the incumbent technologies are the best ones to monitor those environments. Yet the startups are beating incumbents on new technology clients that are running modern, distributed environments. For instance, AppDynamics (a Boundary partner) is the main management tool for Hotels.com and also has Netflix as a customer. So that’s how the market is being segmented right now.

What area of innovation is hot right now in the space of application and network monitoring?

We are very high on the SaaS model. The new vendors are all SaaS-based. It is just so much easier to get up and running and maintain the software that way. Boundary does a lot of big data handling that is offloaded to the cloud, which is another wonderful benefit for a customer. Lots of VC dollars are being poured into this space right now and we are only at the tip of the iceberg. The Big 4 have all introduced new SaaS offerings because they have to in order to compete, but they are going to be playing catch-up for a while.

## Boundary helps Netradar launch mobile performance app

Posted by on May 16th, 2013

Netradar is a global mobile network measurement and analysis service and smart phone app designed by researchers at Aalto University in Finland. Netradar delivers data on mobile performance to users through a smartphone app (Android, iOS, Windows Phone, Symbian and Meego). Their app has more than 30,000 installs to date. Users can see an aggregated map of network throughput between different carriers and various statistics about devices and networks. Arttu Tervo, a developer with Netradar, re-built the server architecture in early 2013 using Amazon Web Services primarily, to prepare for growth in users and features.

How Boundary Is Helping

Network planning and potential cost reduction:  Tervo deployed Boundary to help make design decisions as he was rewriting Netradar. “Using Boundary helped me detect potential issues in the system where there would be over-use of the network,” he observes. “Seeing that ahead of time was very important so that we can optimize our system and save on our service costs.” He estimates that Boundary helped conserve 50 percent of the required internal data transfer amounts needed to run the platform.

Development agility: Boundary is also helping the Netradar team with rapid development processes, through enabling a simple method to test new features in the staging environment for their performance-readiness. “With the help of automated Capistrano deployment events to Boundary, we can see the effect of even the smallest change on network usage immediately,” Tervo remarks.  “It hasn’t been this easy in my previous projects.”

“The most important benefit of Boundary to Netradar so far is the ability to detect possible performance issues before we make any changes in the production system. As we grow and scale our service around the globe, we will benefit from Boundary’s trending views showing how much data we transferred over our customer base, which will help us plan better for our AWS usage.” — Arttu Tervo, developer with Netradar

## Approximate Heavy Hitters -The SpaceSaving Algorithm

Posted by on May 14th, 2013

Here at Boundary we do analytics a little bit differently than the standard “store it and batch it” approach. We’ve built a streaming analytics system that processes data as it comes to us in real time. Given our first-hand experience we think Streaming algorithms are really cool and insanely practical. Fortunately they’re heavily researched. Unfortunately they’re also relatively new in the professional world which means that when talking about these algorithms, the divide between the academic world and industry can feel pretty large. We thought we’d do our part to help bridge that gap by publishing a few blog posts about cool streaming algorithms we’ve come across.

## Defining the Problem

The problem that inspired this post can be stated pretty simply.

Given a single pass over an infinite stream of data, find the most frequently occurring items in the stream.

This is a pretty common problem, and one we encounter at Boundary. It turns out that the provably most efficient solution to this problem scales badly (we’ll come back to this). This is also a common enough problem that there’s a lot of research being done on trying to get around this restriction by finding an approximate solution. Approximation algorithms are awesome – they’re almost always far more space efficient or computationally efficient than than the exact solution to the problem they’re trying to solve, in exchange for some (hopefully bounded) form of error in the solution. Depending on the problem and the kind of error it introduces, using an approximate algorithm can feel like getting something for nothing.

We’ll be looking at an approximation algorithm called SpaceSaving that tries to solve this problem in the minimum amount of space. It turns out, the algorithm is able to give you an approximate solution with bounded error (the bound on error is determined ahead of time) in a fixed amount of space.

That’s pretty awesome.

Graham Cormode and Marios Hadjieleftheriou published an excellent overview of some known approximate solutions to this problem in particular. If this blog post is even mildly interesting to you, I highly recommend this paper, and will be using some terminology from it during this post.

We need to spend a few minutes on definitions before diving in:

We’re going to assume that our infinite stream of data contains elements of some infinite set. We’ll call the set of items we’ve seen so far $A$. So, at any point in time, we’ve seen exactly $|A|$ distinct elements.

We’re going to use $N$ to refer to the length of the stream so far. This is the total number of items seen so far, not the total number of distinct items seen so far, so $N$ is always at least as large as $|A|$.

We also need a formal, mathematical definition of the problem. Finding the most frequently occurring items means finding items that occur more often than some threshold. This threshold is usually defined as some fraction of $N$ – this bounds the solution space relative to the size of the stream without forcing a solution to include exactly $k$ items (for some arbitrary choice of $k$). Given that, the the problem can be formalized as finding all items in a stream whose frequency is greater than $p N$ for some $0 < p < 1$. The approximate version of the problem introduces some error $0 < \epsilon < 1$  and asks for all items with frequency greater than $(p - \epsilon)N$. In English, we’re trying to find items whose frequency accounts for more than a given percent of the stream, and we’re going to allow some error around what the actual percentage is. That is, if we specify that we’d like to see items that each account for at least 10% of the stream, and 2% error, it’s acceptable to include items that account for only 8% of the stream. We’ll borrow some nomenclature from the overview paper I mentioned above, and refer to this as “the frequent items problem”.  Frequent items are also referred to in the literature as “heavy hitters”. We’ll throw that name around too.

It’s pretty common for items in a stream to have some kind of explicit value attached. The canonical example is a stream of cash-register transactions, where the world’s unluckiest cashier has to record how much money the store made in an infinite number of transactions.  We can tweak our definition of the frequent items problem to cover this by letting items in a stream have non-negative weights. If you see an item $e_1$ with weight $w_1$, that’s basically equivalent to seeing item $e_1$ exactly $w_1$ times in a row. There’s more rigor you can apply, especially if you’d like to talk about negative weights, but we’ll ignore all of that for this blog post.

Actually estimating the frequency of individual items can be formalized as finding an estimate $f'$ of an item’s frequency $f$ such that $|f - f'| < \epsilon N$. This says that estimates are allowed to be incorrect, but the difference between the estimate and the true value has to be bounded by a fraction of the length of the stream so far. The error-term $\epsilon$ here is the same as the $\epsilon$ above. Again, we’ll borrow some nomenclature, and refer to this as “the frequency estimation problem”.

Finally, we’re going to be pragmatic about what infinity means here: really really really big. Annoyingly, we’re going to define it as big enough that solution involving more computing power or more storage just won’t work – the stream is big enough that you can never buy enough RAM, or can’t just add another node to your cluster-computing setup of choice. Think about something like the Twitter firehose or a high-volume stream of netflow data, or something with equal volume that never stops.

Solving the Problem Exactly

Let’s also spend a moment and look at the exact solution for counting frequent items, so that we have a baseline for understanding what we gain/lose in by using an approximate algorithm like SpaceSaving.

In pseudo-code the simplest way to find the most frequent items in a finite data-set looks something like:

weights = {} // Initialize a weight counter for all items to 0
for (item, weight) in data_set:
weights[item] += weight

sort_by_weight(weights)
heavy_hitters = take_largest_10(weights)

In English, this keeps a running sum of weights per-item, and eventually sorts the list of observed items by weight. There are a few things to point out:

So, the exact solution to the problem requires keeping track of every item seen so far so that you can sum weights together. This means that the exact approach takes at least $O(|A|)$ space. This approach also needs to do a lookup for every item in the stream to increment the corresponding counter, and then needs to sort the list of all items which takes  $O(N) + O(|A|log(|A|))$ time.  This can be done a little more efficiently by using a priority queue instead of sorting the entire list in place – if you’re after the top-$k$ elements of a stream, you can finish in $O(N) + O(|A|log(k))$ time.

Since the time-bounds on this approach are fairly reasonable, we’re going to leave them alone for now. They will act as a nice baseline later, though, so don’t forget about them entirely.

However, the space bounds for this approach are a little more problematic. Scaling linearly with the number of distinct items in the stream, $O(|A|)$, isn’t really going to cut it; $|A|$ is only bounded by $N$, the length of the stream so far, which we defined so that it can be arbitrarily large. Our machine is going to start running out of available memory as our stream gets larger and larger and larger. This is the point at which you say “just add more RAM” and I remind you that we (in)conveniently defined the problem so that wasn’t an option. We’ll have to find another approach.

## Wait, seriously, that works?

Invariably, while explaining the idea for this blog post to folks, someone would say – “wait, can’t you just keep around a fixed size priority queue and update it as you see new items? Just kick out the small things as you go”. As It turns out, this doesn’t end up working; a frequently occurring item with low weight can get accidentally ignored, as can a frequent item that doesn’t show up until well into the stream, since they’ll never be able to displace the smallest item in the priority queue.

However it turns out that, with a bit of a tweak and a lot of math, an eerily similar approach ends up working.

Like we mentioned above, the SpaceSaving algorithm devised by Metwally et. al (and the accompanying StreamSummary data structure) solves both the frequent items problem and the frequency estimation problem using a fixed amount of space. In a little more detail, SpaceSaving is a deterministic, counter-based algorithm for solving the frequent items problem with fixed error guarantees relative to $N$. On heavily skewed data sets, the algorithm performs extremely well and the error bounds can be made extremely tight (The paper does an excellent job of showing the difference in error when the algorithm is applied to Zipfian data, but I’m not going to cover that here).

So, what’s the secret?

SpaceSaving works by keeping exact (item, count) pairs for the first $m$ distinct items (we’ll discuss what $m$ is shortly) observed in the stream. Subsequent items, if they’re already monitored, increment the count in the proper pair. If an incoming item isn’t already being monitored, it replaces the item in the (item, count) pair with the smallest count, and then increments its count as usual. Any replaced item also needs an error associated with it – the first $m$ items are counted exactly, so their error counts are set to 0. Whenever an item replaces the previous minimum, it could have been seen anywhere between 1 and min(count)+1 times, so its error counter gets set to min(count), which is the most that item’s count may have been overestimated by.

In pseudo-code, SpaceSaving looks like this:

counts = { } // An empty map of item to count
errors = { } // An empty map of item to error count

for (item, weight) in stream:
if len(counts) < m:
counts[item] += weight
else:
if item in counts:
counts[item] += weight
else:
prev_min = item_with_min_count(counts)
counts[item] = counts[prev_min] + weight
errors[item] = counts[prev_min]
counters.remove_key(prev_min)

The algorithm has a few nice properties right off the bat:

• It’s deterministic. The same input in the same order always produces the same results. The algorithm doesn’t introduce any randomness.
• Every item in the stream increments some counter, so the sum of all counters will be equal to the current size of the stream, $N$.
• With a predetermined $\epsilon$ for the frequent items problem, the algorithm uses space inversely proportional to the error. In other words, the space required is $O(\frac{1}{\epsilon})$. This is where the $m$ mentioned above comes from: the algorithm will solve the frequent items problem with the given error if $m$ is chosen to be larger than $\frac{1}{\epsilon}$.

Whoa. Those are all pretty cool guarantees. Determinism is particularly cool for an approximate algorithm, but the fixed size and the sum of the counters always being $N$ is kind of confusing. The $m$ items that the algorithm tracks aren’t all going to be the most frequent items. This is where the error counts from the algorithm come in. An item is a frequent element if the count - error reported by the algorithm is larger than $pN$. From above, remember that $pN$ is just some fraction of the number of items seen so far, so this says that an item monitored by the algorithm is a frequent item if the counts that we can guarantee are larger than the cutoff. Conservative and correct. This wouldn’t be quite so useful, except that the authors also prove that any item whose true frequency is at least min(count) is monitored by the algorithm. The combination of these two properties is what makes SpaceSaving an approximate solution to the frequent items problem, and not just something weird you can do to your data.

This is also where the error $\epsilon$ comes back into play. Even though it monitors $m$ items, it doesn’t guarantee they’re all frequent items. In fact, it might report items that are not-quite-frequent enough, or might not report borderline items if their error terms are high enough.

That’s a pretty cool algorithm.

The authors also spend part of the paper proposing a StreamSummary data structure that efficiently maintains (item, count, error) counters and tracks the minimum (there’s an implementation here). In exchange for using slightly more space, you can do the same thing with an implementation based on a priority queue and a hash-table to check whether or not an item is being monitored. This ends up being less space-efficient than StreamSummary in exchange for better insert/update and query times. We’ve been playing with an implementation in that style, and might post about it at some point in the future.

The only downside of SpaceSaving/StreamSummary seems to be approximating the individual frequencies of items. The way $\epsilon$ is defined, the error in the individual frequencies is relative to the number of items seen so far, and not relative to the frequencies themselves. This means that the error on the frequency counts for individual elements can be arbitrarily large, relative to the true frequency of the actual element. In practice, this can mean an outrageous amount of error relative to the true frequency of an individual element. We haven’t thrown an implementation into production here at Boundary because of this – we’d like to be able to guarantee our customers that the flow-data we report to them is correct.

Despite this downside, we think SpaceSaving (and StreamSummary) is particularly cool because it’s so simple: no hashing is involved and the correctness proofs are relatively accessible. It’s also a perfect example of the kind of tradeoff you have to make when dealing with an approximate algorithm. By giving up on error bounds on individual item frequencies, you gain bounded error on frequent items relative to $N$ and get to solve the problem in a fixed amount of space.

## Bibliography

We linked to the following papers above:

There’s much more excellent reading to be done on heavy hitter algorithms, an probabilstic algorithms in general: the gang at Aggregate Knowledge has done a short post on Count-Min Sketch which solves the same heavy-hitters problem by taking a bloom-filter-esque approach.

## Opscode and Boundary: Visibility Plus Change Automation

Posted by on May 10th, 2013

We spoke with Bryan Hale, VP of Online Services at cloud infrastructure automation provider Opscode, to learn how Opscode customers benefit from Chef’s integration with Boundary, and to get his opinion on the future of IT monitoring. Before Opscode, Hale worked at VC firm Draper Fisher Jurvetson (DFJ), and in the Corporate Development group at salesforce.com. Follow Bryan on Twitter: @halebr.

Boundary: What are the top problems that your product solves for customers?

Hale: Opscode Chef is an open source configuration management and IT automation framework whose corporate contributors include Facebook, Google, Rackspace, Dell, HP and hundreds of other leading technology companies. Chef is flexible enough to be wielded in any number of different use cases, but two of the most popular are:

• Server Configuration Management: Chef saves operations engineers time and money by making sure that every system in a given environment is built and maintained exactly as it needs to be, in a fully automated fashion: every package is installed, file written, service turned on and so forth.
• Continuous Delivery: Chef’s templates (“cookbooks” and “recipes”) can be managed exactly like code, allowing development and operations teams to quickly and consistently deliver enhancements to both applications and the infrastructure that supports applications.

Boundary: How did your product’s integration with Boundary come about, and how do customers benefit?

Hale: Many of our customers have highlighted the fact that Opscode and Boundary are working on complementary technologies that, when working in concert, can deliver an awful lot of value. Both Chef and Boundary are optimized for public, private and hybrid clouds, where visibility and instantaneous response are essential. The new Boundary event management service and API allows Chef to send richer information into Boundary, enabling IT operations teams to instantly see the detailed status of each automation request.

Boundary: From your view, how is the world of monitoring, and monitoring technologies, evolving?

Hale: It’s no secret that core IT infrastructure is undergoing a massive shift towards systems that are immediately and abundantly available and that need to handle rapid change. This has a wide set of consequences for the world of management and monitoring. To keep up, monitoring tools face a tall order. They must keep up with a greater number of potential issues within environments that have more scale and complexity than ever before. Offerings such as Boundary, with very fine levels of granularity and a SaaS delivery model, are purpose-built for this new world.

Page 1 of 39 Older Posts