Kuala Lumpur and Malacca City

The week before Strata+Hadoop World in Singapore, we snuck up the Malaysia for a quick vacation. The food in Malaysia – particularly at hawker (food) courts – was outstanding. We stayed in Petaling Jaya (“PJ”) an area that young professionals and families seem to favor. Below are some highlights:

Selena Malam SS2 Food Court

Wong Ah Wah (or W.A.W.) on Jalan Alor street – the chicken wings are simply amazing:

Other things worth noting: We enjoyed the bird park and the Islamic Arts Museum of Malaysia.

Malacca City

Restoran Asam Pedas Selera Kampung

Restoran Ole Sayang Sdn. Bhd.

Kopi Luwak (we found a distributor who carried the wild/”free range” variety)

Singapore Eats

This year marks the debut of Strata+Hadoop World in Singapore, and I decided to spend most of this week checking out the city and visiting with friends. Below are some of the places that I’ll be recommending to friends and Strata attendees:

Ah-Tai Hainanese Chicken Rice (two stalls from Tian Tian, its Chef of 20 years setup shop):


MTR, in Little India. Walking through Serengon on a Sunday is something I highly recommend:

Usman Restaurant (Pakistani food in Little India)

Chomp Chomp Food Centre (via Eugene Teo)

Tiong Bahru – one of the oldest housing estates in Singapore – is a great place to relax and kick back. Here are two Tiong Bahru places we enjoyed:

Ting Heng Seafood restaurant

Architecting big data applications in the cloud

The O’Reilly Data Show podcast: Jai Ranganathan on the Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.


[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. …  It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Architecting data applications in the cloud

There are some fundamental design principles behind the original HDFS implementation, which don’t actually work in the cloud. For example, this notion that data locality is fundamental to this system design; it starts changing in the cloud when you’re looking at these large cloud providers — they are doing all these software-defined networking tricks and they can do bisectional bandwidth, like 40 gigs per second, across their data center … suddenly, you’re talking about moving hundreds of terabytes of data back and forth from a storage to a compute layer without any huge performance penalties. Suddenly, their performance is disadvantageous to this, but it’s not as bad as you think. Some of the core design principles in Hadoop have to change when you think about this kind of new data center design. … The cloud part is really interesting, but really what to me is interesting is there’s a fundamental shift in the way data centers are being designed, which we have to make sure that Hadoop stays designed to capitalize on.

… A lot of the work we do on the cloud is to optimize working with these object stores effectively. Obviously, you still need some local storage for things like spill, but that’s not really the same as a distributed file system. Then, it’s really a question of getting all the frameworks to run really effectively against an object store.

Paying attention to hardware trends

When I joined Cloudera, a customer who was going crazy and buying the most expensive hardware was buying 64 gigabytes of RAM. On that 64 gigabytes of RAM, they also had 12 disk spindles with two terabytes each and 24 terabytes of disk. At this point, today, many of my customers buy 246 gigabytes of RAM or even potentially 384 gigabytes to 512 gigabytes of RAM. The amount of disk is still exactly the same. Because disks don’t spin faster and you still want a certain level of throughput, you’re still looking at 24 terabytes of disk in your machine. Already in just two years, we have seen it go from 64 to 512, potentially. I don’t think this trend is going to stop, and we are suddenly going to be looking at, within three years, one-terabyte RAM machines.

… What we’re finding is that in a lot of the things we do at Cloudera, like Kudu or Impala, fundamentally, we really care about wringing performance out of the CPU. A lot of this will be like, ‘can I do vectorize operations?’ and ‘can I make sure to take advantage of my L2 cache mode effectively?’ because that allows my CPU to spend more efficiently. It really changes the
bottleneck from the I/O subsystem to the CPU subsystem, and everything you can do to eke out performance there really matters.

… Project Tungsten is basically in the Spark community to do more CPU-efficient things, whether that’s vectorizing stuff, whether that’s actually effectively moving away from managed memory to managing by buffers, so you can actually have much more efficient handling of memory, so you can get better CPU efficiency as well.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources:

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

[This piece was co-written by Shannon Cutt. A version of this post appears on the O’Reilly Radar.]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer. … Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’veexternalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make. … There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline. … Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Subscribe to the O’Reilly Data Show Podcast: Stitcher, TuneIn, iTunes,SoundCloud, RSS

Related resources: