Build your own data-stream mining NSA in the cloud with “FunnelCake” | Ars Technica

Biz & IT —

Build your own data-stream mining NSA in the cloud with “FunnelCake”

BrightContext puts a SQL-like language atop analyitcs tech spawned by Twitter.

There's more to "big data" than just lots of bits on disks. Some things you can't just store in the raw; others you need to analyze and process before they ever hit a disk. That way they can be acted upon in near-real time—like trying to pick specific communications sessions from the data stream of a network tap into an Internet backbone, for example.

Stream processing, also known as Complex Event Processing, is the real-time querying, analysis, and conversion of information within torrents of live data. It's part of what deep packet inspection and packet capture systems do with network traffic. These tools apply a set of rules to filter out what to capture within Internet packets, then aggregate and transform what's in them into captured content and metadata about the content of those packets. Security event information management (SEIM) systems do the same thing with log files, reports from packet sniffers, and other sources. They pull multiple streams together and analyze them for connections and possible security threat signatures.

But the need for stream processing isn't unique to intelligence organizations like the National Security Agency (NSA). A software-as-a-service startup called BrightContext has released a new version of its stream processing service. BrightContext makes building complex stream processing systems as easy as filling out a Web form and writing a few lines of script. Now, nearly anyone can create their own miniature NSA data center in Amazon's cloud. The language, called FunnelCake, looks a bit like SQL or JavaScript. But unlike SQL, it's designed for never-ending queries against huge streams of data running in parallel.

Riders on the Storm

BrightContext CEO John Funge and CTO Leo Scott started working on the ideas behind the service in 2010. The pair sold their photo-sharing service, Pickle.com, to Scripps Networks (the owners of cable channels like HGTV and Food Network). Next, the duo was pulled into helping Scripps deal with problems involving high volumes of user interactions. "With TV audiences, you have millions of viewers," Funge said. "But as soon as you start having viewers interact, your Web servers take a hit. The problem is, how do you take in all this input and translate it to give stuff back?"

Part of the answer to that question arrived in 2011, when Twitter acquired data analytics startup BackType and then published a big part of BackType's technology as free and open source software. The software, called Storm, is the real-time computing platform used to power analytics and other stream-driven features at Twitter and other companies like Groupon or the Weather Channel. It's also a major piece of the underpinnings of BrightContext's platform.

"Big data" analytics systems that work with large repositories of data, such as Hadoop, typically deploy large numbers of worker apps running in parallel to sort through information and return results. Storm is designed to do a similar thing with data "in flight." It allows developers to build perpetually running worker applications, called "bolts," that can be plugged together in workflows to search, aggregate, and transform raw data into usable information.

Storm has some relatively simple programming interfaces, but the actual management of a cluster of Storm servers and integration into data sources is a bit more complex. BrightContext hides that complexity, including the messaging middleware used to wire together the worker applications and the management system that spawns them and restarts them when they crash. This is all behind a Web dashboard where the pieces run within Amazon's cloud.

The Washington Post and AOL were early customers, and they are still using BrightContext to process audience feedback. The Post used the service to perform analysis on audience interactions during the presidential debates, according to Funge. "Their front-end developers, who are part of the newsroom team, did almost all the work," Funge said. "They were so self-sufficient and grabbed our software developer kit and APIs with such a light amount of support from us that we were wondering if there was an issue with the software. But it was because it was so straightforward to them."

<em>The Washington Post</em>'s sentiment tracker application, powered by BrightContext's stream processing, for the first Obama/Romney presidential debate.
Enlarge / The Washington Post's sentiment tracker application, powered by BrightContext's stream processing, for the first Obama/Romney presidential debate.

A series of funnels

BrightContext's software-as-a-service versions of the Storm bolts are called "QuantChannels," which apply some filtering or transformation to data passing through them, and "ThroughChannels," which simply pipe data from a stream unprocessed into other applications. Both can be set up through the Web dashboard by defining the data elements that will come in through the stream. When they're turned on, their output can be directed back to an analytics application or other worker apps in the BrightContext cloud for further filtering and processing.

There are two kinds of QuantChannels, Funge said. "There's the more straightforward sort of stream processing where the servers running channels can act alone, like filtering." Those sorts of tasks can be configured with little code at all using BrightContext's Web dashboard to configure them. "Then there's the harder kind," he continued. "To get an answer, you need to know something about calculations going on across all the other servers, such as aggregation where you're combining all the results with math."

Documentation of two of the five methods used in FunnelCake to process data streams.
Enlarge / Documentation of two of the five methods used in FunnelCake to process data streams.

This is where FunnelCake comes in. It can be used to construct complex sets of calculations and aggregations, doing in a handful of lines of script what would normally be days of coding. "We are able to take what would otherwise be thousands of lines of Java code or some other language in Storm and boil it down to a few lines of code in FunnelCake," Funge said.

BrightContext isn't the only player in the cloud market for handling data streams. Axeda, for example, offers a cloud-based platform for handling machine-to-machine data streams and tying them into enterprise information systems. But BrightContext is one of the first companies to build this sort of general-purpose stream processing as a service platform in a public cloud. And it has given anyone with a data stream the ability to mine their own with NSA-like capability on demand.

Listing image by Lorax

Reader Comments (4)

View comments on forum

Loading comments...

Channel Ars Technica