Turning big data into actionable insights

[A version of this article appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: Evangelos Simoudis on data mining, investing in data startups, and corporate innovation.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Can developments in data science and big data infrastructure drive corporate innovation? To be fair, many companies are still in the early stages of incorporating these ideas and tools into their organizations.

Evangelos Simoudis has spent many years interacting with entrepreneurs and executives at major global corporations. Most recently, he’s been advising companies interested in developing long-term strategies pertaining to big data, data science, cloud computing, and innovation. He began his career as a data mining researcher and practitioner, and is counted among the pioneers who helped data mining technologies get adopted in industry.

In this episode of the O’Reilly Data Show, I sat down with Simoudis and we talked about his thoughts on investing, data applications and products, and corporate innovation:

Open source software companies

I very much appreciate open source. I encourage my portfolio companies to use open source components as appropriate, but I’ve never seen the business model as being one that is particularly easy to really build the companies around them. Everybody points to Red Hat, and that may be the exception, but I have not seen companies that have, on the one hand, remained true to the open source principles and become big and successful companies that do not require constant investment. … The revenue streams never prove to be sufficient for building big companies. I think the companies that get started from open source in order to become big and successful … [are] ones that, at some point, decided to become far more proprietary in their model and in the services that they deliver. Or they become pure professional services companies as opposed to support services companies. Then they reach the necessary levels of success.

Continue reading

How intelligent data platforms are powering smart cities

[A version of this post appears on the O’Reilly Radar.]

Smart cities and smart nations run on data.

According to a 2014 U.N. report, 54% of the world’s population resides in urban areas, with further urbanization projected to push that share up to 66% by the year 2050. This projected surge in population has encouraged local and national leaders throughout the world to rally around “smart cities” — a collection of digital and information technology initiatives designed to make urban areas more livable, agile, and sustainable.

Smart cities depend on a collection of enabling technologies that we’ve been highlighting at Strata + Hadoop World and in our publications: sensors, mobile computing, social media, high-speed communication networks, and intelligent data platforms. Early applications of smart city technologies are seen in transportation and logistics, local government services, utilities, health care, and education. Previous Strata + Hadoop World sessions have outlined the use of machine learning and big data technologies to understand and predict vehicular traffic and congestionpatterns, as well the use of wearables in large-scale health care data platforms.

As we put together the program for the upcoming Strata + Hadoop World in Singapore, we have been cognizant of the growing interest in our host country’s Smart Nation program. And more generally, we are mindful that large infrastructure investments throughout the Asia-Pacific region have engaged local leaders in smart city initiatives. For readers comfortable with large-scale streaming platforms, many of the key technologies for enabling smart cities will already be familiar:

Data collection and transport

In smart cities, Internet of Things and industrial Internet applications, proper instrumentation, and data collection depend on sensors, mobile devices, and high-speed communication networks. Much of the private infrastructure belongs to and is operated by large telecommunication companies, and many of the interesting early applications and platforms are originating from telcos and network equipment providers.

Data processing, storage, and real-time reports

As I noted in an earlier article, recent advances in distributed computing and hardware have produced high-throughput engines capable of handlingbounded and unbounded data processing workloads. Examples of this include cloud computing platforms (e.g., AWS, Google, Microsoft) andhomegrown data platforms comprised of popular open source components. At the most basic level, these data platforms provide near real-time reports (business intelligence) on massive data streams:

Real-time Data Fusion.

Intelligent data applications

Beyond simple counts and anomaly detection, the use of advanced techniques in machine learning and statistics opens up novel real-time applications (machine-to-machine) with no humans in the loop. Popular examples of such applications include systems that power environments like data centers, buildings and public spaces, and manufacturing (industrial Internet). Recognizing that future smart city applications will rely on disparate data — including event data (metrics from logs and time-series), unstructured data (images, audio, text), and geospatial data sources — we have planned sessions at Strata + Hadoop World Singapore that will cover advanced analytic techniques targeting these data types.

Smart city platforms represent some of the more exciting and impactful applications of real-time, intelligent big data systems. These platforms will confront many of the same challenges faced by applications in the commercial sector, including security, ethics, and governance. At Strata + Hadoop World Singapore, we’re looking forward to highlighting the intersection of communities and technologies that power our future cities.

Related content

Resolving transactional access and analytic performance trade-offs

[A version of this article appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

In recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu —  that’s good at both table scans (analytics) and random access (updates and inserts).

While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.

During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:

HDFS and Hbase

[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Continue reading

Specialized and hybrid data management and processing engines

A new crop of interesting solutions for the complexity of operating multiple systems in a distributed computing setting

The 2004 holiday shopping season marked the start of Amazon’s investigation into alternative database technologies that led to the creation of DynamoDB — a key-value storage system that went onto inspire several NoSQL projects. A new group of startups began shifting away from the general-purpose systems favored by companies just a few years earlier. In recent years, we’ve seen a diverse set of DBMS technologies that specialize in handling particular workloads and data models such as OLTP, OLAP, search, RDF, XML, scientific applications, etc. The success and popularity of such systems reinforced the belief that in order to scale and “go fast,” specialized systems are preferable.

In distributed computing, the complexity of maintaining and operating multiple specialized systems has recently led to systems that bridge multiple workloads and data models. Aside from multi-model databases, there are an emerging number of storage and compute engines adept at handling different workloads and problems. At this week’s Strata + Hadoop World conference in NYC, I had a chance to interact with the creators of some of these new solutions.

OLTP (transactions) and OLAP (analytics)

One of the key announcements at Strata + Hadoop World this week was Project Kudu — an open source storage engine that’s good at both table scans (analytics) and random access (updates and inserts). Its creators are quick to point out that they aren’t out to beat specialized OLTP and OLAP systems. Rather, they’re shooting to build a system that’s “70-80% of the way there on both axes.” The project is very young and lacks enterprise features, but judging from the reaction at the conference, it’s something the big data community will be watching. Leading technology research firms have created a category for systems with related capabilities:  HTAP (Gartner) and Trans-analytics (Forrester).

Search and interactive analytics (SQL)

If you had a chance to walk around the large Strata + Hadoop World expo hall, you probably noticed many companies positioning themselves to handle large-scale, real-time, machine-generated data. Many of these companies specifically target log files — given the success of companies like Splunk, there is a proven market for such tools. Moreover, it turns out that search and interactive analytics (SQL) are used by analysts wanting to make sense of massive amounts of log files.  A few startups have attempted to build on open source ecosystem components by combining a search tool (Lucene) and some SQL-on-Hadoop engine.

A while back, I played around with SenseiDB — an open source project that adds a query language (and faceted search) to a search engine — and that experience made me appreciate the power of combining search and SQL. More recently, a new San Francisco Bay Area startup called X15 Softwarebuilt an engine that combines search and SQL capabilities, and aimed it specifically for analysts who work with log files (and other machine-generated data).

Bounded and unbounded data processing and analytics

One of the takeaways from Tyler Akidau’s extremely popular article on streaming is that our labels — batch and streaming — are fast becoming outdated. Batch and streaming traditionally have been used to describecompute engines, but with the rise of engines that can do both, we’re better off describing the type of data in question: bounded andunbounded/continuous. These “unified” engines come in two flavors: batch engines that can handle streaming problems (e.g., Spark Streaming), and streaming engines that can also be used for batch computations (e.g., Google Dataflow).

Streaming, of course, arises in the context of real-time processing and analytics (a major focus at this year’s conference). One side note: whenever someone tells you that few companies use or need real-time, they’re likely referring to settings where human decision-makers are in the loop (“human real-time”). That misses the mark because the true impact of these technologies will be in applications with no humans in the loop. As UC Berkeley Professor Joe Hellerstein noted a while back, “real-time is for robots.”