Introduction to Tachyon and a deep dive into Baidu’s production use case

I pleased to announce a webcast that I’ll be hosting a webcast featuring the co-creator of Tachyon (full disclosure: I’m an advisor to Tachyon Nexus) alongside one of the architects behind Baidu’s big data platform. I hope to see you online on Sept 14th!

Tachyon is a memory-centric fault-tolerant distributed storage system, which enables reliable data sharing at memory-speed. It was born in the UC Berkeley AMPLab and is completely open source. Multiple companies deploy Tachyon, for example, Baidu is running a production Tachyon cluster with 150 nodes managing over 2 PB of storage space. Tachyon has more than 100 contributors from over 30 institutions, including Baidu, IBM, Intel, and Yahoo.

In this webcast, Haoyuan Li from Tachyon Nexus will present an overview of Tachyon, as well as some recent development and use cases. After that, Shaoshan Liu from Baidu will present their experience with Tachyon. He will describe how they achieved 30x end-to-end performance improvement using Tachyon, how they addressed problems encountered when they started using Tachyon, what new features they want to see, and future plans to scale further.

Bringing Apache Spark closer to bare metal

Fans and users of Apache Spark will want to attend a webcast I’ll be hosting next week (Sept 3rd), featuring Josh Rosen – one of the early developers behind PySpark:

Deep dive into Project Tungsten: Bring Spark closer to bare metal

Project Tungsten focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. This effort includes three initiatives:

  • Memory Management and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection
  • Cache-aware computation: algorithms and data structures to exploit memory hierarchy
  • Code generation: using code generation to exploit modern compilers and CPUs
  • Project Tungsten will be the largest change to Spark’s execution engine since the project’s inception. In this talk, we will give an update on its progress and dive into some of the technical challenges we are solving.

    Scikit-Learn 0.16

    I’ll be hosting a webcast featuring two of the key contributors to what is arguably one of the most popular machine learning tools today – scikit-learn:

    News from Scikit-Learn 0.16 and Soon-To-Be Gems for the Next Release
    presented by: Olivier Grisel, Andreas Mueller

    This webcast will review Scikit-learn, a widely used open source machine learning library in python, and discuss some of the new features of the recent 0.16 release. Highlights of the last release include new algorithms such as approximate nearest neighbors search, Birch clustering and a path algorithm for logistic regression, probability calibration, as well as improved ease of use and interoperability with the Pandas library. We will also highlight some up-and-coming contributions, such as Latent Dirichlet Allocation, supervised neural networks, and a complete revamping of the Gaussian Process module.

    In addition, Olivier will be leading what promises to be a popular tutorial at Strata+Hadoop World in London in early May.

    scikit-learn webcast and tutorial

    Apache Spark 1.3, the new Dataframe API, and Spark performance

    Over the course of a week, I’ll be hosting two good webcasts featuring Spark release manager Patrick Wendell and Spark committer Kay Ousterhout. Register now!

    • Patrick Wendell: Spark 1.3 and Spark’s New Dataframe API (March 25th at 9 a.m. California time)
    • In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.3 release. Spark 1.3 brings extensions to all of Spark’s major components (SQL, MLlib, Streaming) along with a new cross-cutting Dataframes API. The talk will outline what’s new in Spark 1.3 and provide a deep dive on the dataframe feature. We’ll leave plenty of time for Q and A about the release or about Spark in general.

    • Kay Ousterhout: Making Sense of Spark Performance (April 1st at 9 a.m. California time)

      There has been significant work dedicated to improving the performance of big-data systems like Spark, but comparatively little effort has been spent systematically analyzing the performance bottlenecks of these systems. In this talk, I’ll take a deep dive into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark from UC Berkeley) and one production workload and demonstrate that many commonly-held beliefs about performance bottlenecks do not hold. In particular, I’ll demonstrate that CPU (and not I/O) is often the bottleneck, that network performance can improve job completion time by a median of at most 4%, and that the causes of most stragglers can be identified and fixed. I’ll also demo how the open-source tools I developed can be used to understand performance of other Spark jobs.

    “Humans-in-the-loop” machine learning systems

    Next week I’ll be hosting a webcast featuring Adam Marcus, one of the foremost experts on the topic of “humans-in-the-loop” machine learning systems. It’s a subject many data scientists have heard about, but very few have had the experience of building productions systems that leverage humans:

    Crowdsourcing marketplaces like Elance-oDesk or CrowdFlower give us access to people all over the world that can solve various tasks, such as virtual personal assistants, image labelers, or people that can clean up gnarly datasets. Humans can solve tasks that artificial intelligence is not yet able to solve, or needs help in solving, without having to resort to complex machine learning or statistics. But humans are quirky: give them bad instructions, allow them to get bored, or make them do too repetitive a task, and they will start making mistakes. In this webcast, I’ll explain how to effectively benefit from crowd workers to solve your most challenging tasks, using examples from the wild and from our work at GoDaddy.

    Machine learning and crowdsourcing are at the core of most of the problems we solve on the Locu team at GoDaddy. When possible, we automate tasks with the help of trained regressions and classifiers. However, it’s not always possible to build machine-only decision-making tools, and we often need to marry machines and crowds. During the webcast, I will highlight how we build human-machine hybrids and benefit from active learning workflows. I’ll also discuss learnings from 17 conversations with companies that make heavy use of crowd work that Aditya Parameswaran and I have collected for our upcoming book.

    A recent article in the NYTimes Magazine mentioned a machine-learning system built by some neuroscience researchers that is an excellent example of having “humans-in-the-loop”:

    In 2012, Seung started EyeWire, an online game that challenges the public to trace neuronal wiring — now using computers, not pens — in the retina of a mouse’s eye. Seung’s artificial-­intelligence algorithms process the raw images, then players earn points as they mark, paint-by-numbers style, the branches of a neuron through a three-dimensional cube. The game has attracted 165,000 players in 164 countries. In effect, Seung is employing artificial intelligence as a force multiplier for a global, all-volunteer army that has included Lorinda, a Missouri grandmother who also paints watercolors, and Iliyan (a.k.a. @crazyman4865), a high-school student in Bulgaria who once played for nearly 24 hours straight. Computers do what they can and then leave the rest to what remains the most potent pattern-recognition technology ever discovered: the human brain.

    For more on this important topic, join me and Adam on January 22nd!

    Spark 1.2 and Beyond

    Next week I’ll be hosting a webcast with Spark’s release manager – and Databricks co-founder – Patrick Wendell. (Full disclosure: I’m an advisor to Databricks.)

    In this webcast, Patrick Wendell from Databricks will be speaking about Spark’s new 1.2 release. Spark 1.2 brings performance and usability improvements in Spark’s core engine, a major new API for MLlib, expanded ML support in Python, and a fully H/A mode in Spark Streaming, along with several other features. The talk will outline what’s new in Spark 1.2 and leave plenty of time for Q and A about the release or about Spark in general.

    We’ll be debuting a new Spark track at the forthcoming Strata+Hadoop World in San Jose. We’re also offering a 3-day Advanced Spark training course, a one day hands-on tutorial (Spark Camp), and the Databricks/O’Reilly Spark developer certification exam.

    Bitcoin and the Future of Money

    I’ll be a hosting a free webcast featuring Andreas Antonopoulos this Wednesday. Author of the new book Mastering Bitcoin, Andreas has emerged as one of the most popular & eloquent proponents of cryptocurrencies and related technologies:

    Bitcoin technology is taking the world of finance by storm. Bitcoin and the blockchain technology that is at its core can be used to quickly build secure global financial services on an open and decentralized platform. Join this webcast to learn what bitcoin is, what makes it special, how to get it and how to use it.

    For more, come to Bitcoin & the Blockchain: An O’Reilly Radar Summit, January 27, 2015, at Fort Mason in San Francisco.