When harnessed correctly, hardware can generate performance improvements in software of up to 60% in an existing setup, with zero or minimal investment.
In this webcast Alex Bordei will look at how Impala, Elasticsearch and Couchbase perform when scaled vertically and horizontally, over a number of different bare metal setups. He’ll discuss testing that produced results that included: going from one hex-core CPU to two deca-core CPUs, from 32 to 192 GB of RAM, from local to distributed storage, and from 2 to 14 instances.
Semi-regular field notes from the world of data:
- Alex Bordei, Getting the Most Out of Your NoSQL DB: Best Practices for Optimizing Infrastructure Performance and Budget (2014-08-07)
- Olivier Grisel, What’s New in Scikit-learn 0.15 and What’s Cooking in the Development Branch? (2014-08-13)
How do you get started using Deep Learning? In a previous post, I noted how many of the tools and best practices are locked away in “oral traditions” shared among practitioners. But recently, open source tools have made Deep Learning somewhat more accessible to hackers. In an upcoming webcast, I’m hosting noted hacker and startup founder Pete Warden as he gives an overview of some of the more popular tools in computer vision:
There have been big improvements in image analysis over the last few years thanks to the adoption of deep learning neural networks to solve vision problems, but figuring out how to get started with them isn’t easy.
In this webcast Pete Warden will walk through some popular open-source tools from the academic world, and show you step-by-step how to process images with them. Starting right from downloading the source and data, setting up the dependencies and environment, compiling, and then executing the libraries as part of a program, you’ll be shown how to solve your own computer vision problems.
PredictionIO a startup that produces an open source machine learning server, has raised a seed round of $2.5M. The company’s engine allows developers to quickly integrate machine learning into products and services. The company’s machine learning server is open source, and is available on Amazon Web Services. As an open source package, the company hopes to attract developers who are interested in “Machine Learning As A Service” but are wary of proprietary solutions.
Machine learning solution providers have traditionally highlighted their suite of algorithms. As I noted in an earlier post, there are different criteria for choosing machine learning algorithms (simplicity, interpretability, speed, scalability, and accuracy). Recently some companies are beginning to highlight tools for managing the analytic lifecycle (deploy/monitor/maintain models).
PredictionIO joins a group of startups (including Wise.io, BigML, Skytree, GraphLab) who develop tools that make it easier for companies to build and deploy (scalable) analytic models. The company is hoping that an open source server is much more attractive to developers and companies. I personally love open source tools, but I think the jury is out on this matter. Particularly for analytics, many large companies are willing to pay for proprietary solutions as long as they meet their needs, and are easy to use and deploy.
Analytics and machine learning are important components of most data applications. But data applications require piecing many other tools in a coherent pipeline (e.g., visualization & interactive analytics, ML & analytics, data wrangling & (realtime) data processing). The recently announced Databricks Cloud has garnered attention precisely because it pulls together many important components into an accessible and massively scalable (distributed computing) platform.
[Full disclosure: I’m an advisor to Databricks.]
Here is a link to Ali Ghodsi’s talk and demo that took the Spark Summit by storm. The demo really captures the power of Databricks Cloud: complex, high-performance, big data analytics at massive scale, accessible to anyone who can write simple scripts (currently supports SQL, Python, Scala).
The demo culminates when Ali shows how easy it is to build a dashboard, that uses streaming data (Twitter) and filter it via machine learning. Databricks Cloud confirms that Apache Spark is a great platform for building data products. Moreover, Databricks Cloud makes it much, much easier to build interesting data products. As Ali notes at the start of his demo, the philosophy behind Databricks Cloud is encapsulated in the following quote:
“Simple things should be simple, complex things should be possible” (Allan Kay)
If you’re interested in data science or data engineering, the demo is well worth watching.
[Full disclosure: I’m an advisor to Databricks.]
Business users are becoming more comfortable with graph analytics
[A version of this post appears on the O’Reilly Radar blog.]
The rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people — Cisco estimates 50 billion connected devices by 2020 — one can imagine applications that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media companies.
This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes & edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just-concluded Spark Summit, it was apparent that GraphX remains a high-priority project within the Spark1 ecosystem.
I recently had a great conversation with Jodok Batlogg, Co-Founder and CEO, Crate Data. We talked about how his experience as CTO of StudiVZ and CEO of Lovely Systems informed how they designed and built CrateDB. A few months ago Crate ended up as the top story on Hacker News, which caught the founders by surprise! I’m looking forward to hosting a free webcast featuring Jodok on July 8th:
Crate Data is a shared-nothing, fully searchable, document-oriented cluster data store. Today, developers need to “glue” several technologies together to store documents, blobs and support searches and queries in near real time over big data. This isn’t always simple or scalable and requires a lot of manual tuning, sharding etc. Crate Data is an open source project that attempts to provide a super simple developers’ nirvana – a real time SQL data store for big data – using elasticsearch, Lucene, Netty and Presto. In this webcast we will demonstrate, step-by-step example how a web service can be deployed with the full service stack (data and application) on a single node and then add nodes as needed just by starting them. Crate is self-configuring and self-healing and can be deployed on one device, many devices or the cloud.