Networks graphs can be used as primary visual objects with conventional charts used to supply detailed views
[A version of this post appears on the O’Reilly Data blog.]
With Network Science well on its way to being an established academic discipline, we’re beginning to see tools that leverage it. Applications that draw heavily from this discipline make heavy use of visual representations and come with interfaces aimed at business users. For business analysts used to consuming bar and line charts, network visualizations take some getting used. But with enough practice, and for the right set of problems, they are an effective visualization model.
In many domains, networks graphs can be the primary visual objects with conventional charts used to supply detailed views. I recently got a preview of some dashboards built using Financial Network Analytics (FNA). In the example below, the primary visualization represents correlations among assets across different asset classes1 (the accompanying charts are used to provide detailed information for individual nodes):
Using the network graph as the center piece of a dashboard works well in this instance. And with FNA’s tools already being used by a variety of organizations and companies in the financial sector, I think “Network Science dashboards” will become more commonplace in financial services.
Network Science dashboards only work to the extent that network graphs are effective (networks graphs tend get harder to navigate and interpret when the number of nodes and edges get large2). One work around is to aggregate nodes and visualize communities rather than individual objects. New ideas may also come to the rescue: the rise of networks and graphs is leading to better techniques for visualizing large networks.
This fits one of the themes we’re seeing in Strata: cognitive augmentation. The right combination of data/algorithm(s)/interface allows analysts to make smarter decisions much more efficiently. While much of the focus has been on data and algorithms, it’s good to see more emphasis paid to effective interfaces and visualizations.
(0) This post is based on a recent conversation with Kimmo Soramäki, founder of Financial Network Analytics.
(1) Kimmo is an experienced researcher and policy-maker who has consulted and worked for several central banks. Thus FNA’s first applications are aimed at financial services.
(2) Traditional visual representations of large networks are pejoratively referred to as “hairballs”.
General-purpose platforms can come across as hammers in search of nails
[A version of this post appears on the O’Reilly Data blog and Forbes.]
As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focused on specific verticals. At their core big data applications merge large amounts of real-time and static data to improve decision-making:
This simple idea can be hard to execute in practice (think volume, variety, velocity). Unlocking value from disparate data sources entails some familiarity with domain-specific1 data sources, requirements, and business problems.
It’s difficult enough to solve a specific problem, let alone a generic one. Consider the case of Guavus – a successful startup that builds big data solutions for the telecom industry (“communication service providers”). Its founder was very familiar with the data sources in telecom, and knew the types of applications that would resonate within that industry. Once they solve one set of problems for a telecom company (network optimization), they quickly leverage the same systems to solve others (marketing analytics).
This ability to address a variety of problems stems from Guavus’ deep familiarity with data and problems in telecom. In contrast, a typical general-purpose platform can come across as a hammer in search of a nail. So while I remain a fan (and user) of general-purpose platforms, the less well-known verticalized solutions are definitely on my radar.
Better tools can’t overcome poor analysis
I’m not suggesting that the criticisms raised against big data don’t apply to verticalized solutions. But many problems are due to poor analysis and not the underlying tools. A few of the more common criticisms arise from analyzing correlations: correlation is not causation, correlations are dynamic and can sometimes change drastically2, and data dredging3.
(0) This post grew out of a recent conversation with Guavus founder, Anukool Lakhina.
(1) General-purpose platforms and components are helpful, but they usually need to be “tweaked” or “optimized” to solve problems in a variety of domains.
(2) When I started working as a quant at a hedge fund, traders always warned me that correlations jump to 1 during market panics.
(3) The best example comes from finance and involves the S&P 500 and butter production in Bangladesh.
I’ll be hosting a webcast on Spark SQL featuring Michael Armbrust of Databricks:
In this webcast, we’ll examine Spark SQL, a new Alpha component that is part of the Apache Spark 1.0 release. Spark SQL lets developers natively query data stored in both existing RDDs and external sources such as Apache Hive. A key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query external data with complex analytics. In addition to Spark SQL, we’ll explore the Catalyst optimizer framework, which allows Spark SQL to automatically rewrite query plans to execute more efficiently.
It’s scheduled for Tuesday, April 29, 2014 at 1PM (San Francisco time). I’ll introduce Michael and moderate a Q&A following his presentation. Spark SQL is generating a lot of interest within the Apache Spark community. This is a great opportunity to learn about it from its lead developer. I hope to see you online on the 29th.
HBase has made inroads in companies across many industries and countries
[A version of this post appears on the O’Reilly Data blog.]
With HBaseCon right around the corner, I wanted to take stock of one of the more popular1 components in the Hadoop ecosystem. Over the last few years, many more companies have come to rely on HBase to run key products and services. The conference will showcase a wide variety of such examples, and highlight some of the new features that HBase developers have added over the past year. In the meantime here are some things2 you may not have known about HBase:
Many companies have had HBase in production for 3+ years: Large technology companies including Trend Micro, EBay, Yahoo! and Facebook, and analytics companies RocketFuel and Flurry depend on HBase for many mission-critical services.
There are many use cases beyond advertising: Examples include communications (Facebook messages, Xiaomi), security (Trend Micro), measurement (Nielsen), enterprise collaboration (Jive Software), digital media (OCLC), DNA matching (Ancestry.com), and machine data analysis (Box.com). In particular Nielsen uses HBase to track media consumption patterns and trends, mobile handset company Xiaomi uses Hbase for messaging and other consumer mobile services, and OCLC runs the world’s largest online database of library resources on HBase.
Flurry has the largest contiguous HBase cluster: Mobile analytics company Flurry has an HBase cluster with 1,200 nodes (replicating into another 1,200 node cluster). Flurry is planning to significantly expand their large HBase cluster in the near future.
Continue reading “5 Fun Facts about HBase that you didn’t know”