Best Practices for Optimizing Infrastructure Performance and Budget

I’ll be hosting a webcast next week – featuring Alex Bordei – on a topic that should be of interest to anyone building data applications and data products:

When harnessed correctly, hardware can generate performance improvements in software of up to 60% in an existing setup, with zero or minimal investment.

In this webcast Alex Bordei will look at how Impala, Elasticsearch and Couchbase perform when scaled vertically and horizontally, over a number of different bare metal setups. He’ll discuss testing that produced results that included: going from one hex-core CPU to two deca-core CPUs, from 32 to 192 GB of RAM, from local to distributed storage, and from 2 to 14 instances.

Bits from the Data Store

Semi-regular field notes from the world of data:

  • Graphistry graph Tucked away in the community room at the recent GraphLab conference, I took a few people to a demo by Graphistry, a startup that lets users visually interact and analyze massive amounts of data. In particular their technology can handle and draw many more points than d3.js thus making it possible for users to visually examine much larger data sets. Based on the feedback I received, many attendees were impressed with Graphistry’s technology and direction. (Full disclosure: I’m an advisor to Graphistry.)
  • GraphLab Create version 0.9: Not only are there many more “toolkits” to choose from (including Gradient Boosting Trees), the new version includes tools for managing and monitoring analytic models and pipelines. More importantly, CEO Carlos Guestrin announced at the recent GraphLab conference that many components will be open source in time for Strata NYC. While the company name (inherited from the original open source project) highlights graphs, GraphLab Create is actually more about tabular data than graphs. No surprise how quickly the company diversified its offerings: it would be tough to build a standalone company focused completely on graph analytics.
  • Lab41: I ran into friends from Lab41, an In-Q-Tel funded software lab focused on big data. They have some interesting open source projects that data scientists and data engineers may like including: (1) Dendrite a software stack for analyzing large graphs and which leverages open source projects GraphLab, TitanDB, and AngularJS. (2) If you have a trove of media or documents, Redwood uses metadata to assign reputation scores and identify anomalous files. These are initial offerings and the good news is that Lab41 has many other open source, big data projects in the works.
  • Hardcore Data Science day at Strata NYC: We have a great lineup of speakers, and I’m particularly looking forward to my co-host Ben Recht’s talk. Register soon as the “best price” ends this Thursday (July 31st).
  • Here’s a chart I created, inspired by Bill Howe’s recent talk at MMDS. Bill’s chart poked fun at machine learning papers, I think this practice is even more common among big data vendors:

    Big Data Vendors

  • Upcoming Webcasts:

    Deep Learning for Hackers

    How do you get started using Deep Learning? In a previous post, I noted how many of the tools and best practices are locked away in “oral traditions” shared among practitioners. But recently, open source tools have made Deep Learning somewhat more accessible to hackers. In an upcoming webcast, I’m hosting noted hacker and startup founder Pete Warden as he gives an overview of some of the more popular tools in computer vision:

    There have been big improvements in image analysis over the last few years thanks to the adoption of deep learning neural networks to solve vision problems, but figuring out how to get started with them isn’t easy.

    In this webcast Pete Warden will walk through some popular open-source tools from the academic world, and show you step-by-step how to process images with them. Starting right from downloading the source and data, setting up the dependencies and environment, compiling, and then executing the libraries as part of a program, you’ll be shown how to solve your own computer vision problems.

    PredictionIO: an open source machine learning server

    PredictionIOPredictionIO a startup that produces an open source machine learning server, has raised a seed round of $2.5M. The company’s engine allows developers to quickly integrate machine learning into products and services. The company’s machine learning server is open source, and is available on Amazon Web Services. As an open source package, the company hopes to attract developers who are interested in “Machine Learning As A Service” but are wary of proprietary solutions.

    Machine learning solution providers have traditionally highlighted their suite of algorithms. As I noted in an earlier post, there are different criteria for choosing machine learning algorithms (simplicity, interpretability, speed, scalability, and accuracy). Recently some companies are beginning to highlight tools for managing the analytic lifecycle (deploy/monitor/maintain models).

    PredictionIO joins a group of startups (including, BigML, Skytree, GraphLab) who develop tools that make it easier for companies to build and deploy (scalable) analytic models. The company is hoping that an open source server is much more attractive to developers and companies. I personally love open source tools, but I think the jury is out on this matter. Particularly for analytics, many large companies are willing to pay for proprietary solutions as long as they meet their needs, and are easy to use and deploy.

    Analytics and machine learning are important components of most data applications. But data applications require piecing many other tools in a coherent pipeline (e.g., visualization & interactive analytics, ML & analytics, data wrangling & (realtime) data processing). The recently announced Databricks Cloud has garnered attention precisely because it pulls together many important components into an accessible and massively scalable (distributed computing) platform.

    [Full disclosure: I’m an advisor to Databricks.]

    Related content:
  • Gaining access to the best machine-learning methods
  • Data scientists tackle the analytic lifecycle
  • Databricks Cloud makes it easier to build Data Products

    Databricks CloudHere is a link to Ali Ghodsi’s talk and demo that took the Spark Summit by storm. The demo really captures the power of Databricks Cloud: complex, high-performance, big data analytics at massive scale, accessible to anyone who can write simple scripts (currently supports SQL, Python, Scala).

    The demo culminates when Ali shows how easy it is to build a dashboard, that uses streaming data (Twitter) and filter it via machine learning. Databricks Cloud confirms that Apache Spark is a great platform for building data products. Moreover, Databricks Cloud makes it much, much easier to build interesting data products. As Ali notes at the start of his demo, the philosophy behind Databricks Cloud is encapsulated in the following quote:

    “Simple things should be simple, complex things should be possible” (Allan Kay)

    If you’re interested in data science or data engineering, the demo is well worth watching.

    [Full disclosure: I’m an advisor to Databricks.]

    There are many use cases for graph databases and analytics

    Business users are becoming more comfortable with graph analytics

    [A version of this post appears on the O’Reilly Radar blog.]

    GraphLab graphThe rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people — Cisco estimates 50 billion connected devices by 2020 — one can imagine applications that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media companies.

    This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes & edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just-concluded Spark Summit, it was apparent that GraphX remains a high-priority project within the Spark1 ecosystem.

    Continue reading

    Super Simple Real-Time Big Data Backend

    I recently had a great conversation with Jodok Batlogg, Co-Founder and CEO, Crate Data. We talked about how his experience as CTO of StudiVZ and CEO of Lovely Systems informed how they designed and built CrateDB. A few months ago Crate ended up as the top story on Hacker News, which caught the founders by surprise! I’m looking forward to hosting a free webcast featuring Jodok on July 8th:

    Crate Data is a shared-nothing, fully searchable, document-oriented cluster data store. Today, developers need to “glue” several technologies together to store documents, blobs and support searches and queries in near real time over big data. This isn’t always simple or scalable and requires a lot of manual tuning, sharding etc. Crate Data is an open source project that attempts to provide a super simple developers’ nirvana – a real time SQL data store for big data – using elasticsearch, Lucene, Netty and Presto. In this webcast we will demonstrate, step-by-step example how a web service can be deployed with the full service stack (data and application) on a single node and then add nodes as needed just by starting them. Crate is self-configuring and self-healing and can be deployed on one device, many devices or the cloud.