Time-turner: Strata San Jose 2016, day 2

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 3 (maybe 5) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day would look:

11:00 a.m.

11:50 a.m.

1:50 p.m.

2:40 p.m.

4:20 p.m.

Time-turner: Strata San Jose 2016, day 1

There are so many good talks happening at the same time that it’s impossible to not miss out on good sessions. But imagine I had a time-turner necklace and could actually “attend” 3 (maybe 5) sessions happening at the same time. Taking into account my current personal interests and tastes, here’s how my day would look:

11:00 a.m.

11:50 a.m.

1:50 p.m.

2:40 p.m.

4:20 p.m.

5:10 p.m.

Hardcore Data Science, California 2016

Ben Recht and I organized another great edition of Hardcore Data Science in San Jose today. As I was preparing to host the track, I had an inkling we had another outstanding sequence of presentations. The day covered hot topics like deep neural networks, practical advice on how to do data science & machine learning at scale, feature engineering, graphs, anomaly detected, structured data extraction, and many other topics at the heart of A.I. From the very first talk, sessions were well attended, the audience was attentive, and the energy in the room was high – and it remained that way throughout the day. A summary can be found below.

Continue reading

Democratizing business analytics

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Duncan Ross on the evolution of analytics, data mining, and data philanthropy.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with one of Strata + Hadoop World’s most popular teachers—Duncan Ross, data and analytics director at TES Global. In his long career in data, Ross has seen several stages of the evolution of tools, techniques, and training programs, and along the way he has interacted with business managers in many countries and regions. In keeping with his wide-ranging interests, we discussed many topics, including business analytics, data science training programs, data philanthropy and data for good, and university rankings.

Here are some highlights from our conversation:

Democratizing big data and data science

If you look over the last 30 years, in some ways things have moved on a lot. There is more flexibility and choice around the software that’s available now. However, in terms of strict usability, if anyone tries to claim that R is as elegant or usable as Clementine, which is what we had back in the late ’90s, it clearly isn’t. I mean Clementine was specifically designed to allow non-technical people to do data analysis.

As we’ve developed down into the world of Hadoop, and R, and Python, et cetera, in some ways we’ve taken a step back, because now, in order to use those tools effectively, or at the most detailed level, you need to have people who have the ability to do some level of programming. You may say, ‘well that’s actually quite a good thing, because those are good skills to have.’ Then there is this counter-argument that says, ‘if we want this truly to be democratized, we want someone who has a marketing focus to be able to pick up these technologies and use them effectively.’ Then, either we need to simplify the software, or we need to find other mechanisms of giving them control.

Data for evil and data for good

Using Data for Evil is our annual roundup of examples of how people have done things badly over the year. We use this as a way of highlighting what you shouldn’t do, and hopefully inspiring people instead to do things for good. We will be updating our Necromantic Quadrant of Evil to show which organizations have improved their evilness from previous years. We will be looking at particularly great examples of malfeasance with data. Increasingly, we see that the ability for organizations to do just plain evil stuff with data grows every year.

… We have definitely have had people who have come to these events, come to the presentations, and actually used this as the springboard into data philanthropy, and actually giving back their time, and their commitment, to using data for good. I hope we can have a wider impact, and that maybe we can help turnaround that oil tanker of evil heading for the coast. That’s a horrible metaphor, but you get the idea.

… The positive spin is that it’s getting easier because people are more aware of when data is being misapplied, and therefore, it’s reported more, so we have more cases. Actually, I think there is a whole new category of evil that is coming up this year, and I will be talking about that at Using Data for EVIL IV – The Journey Home.

Ranking the world’s universities

As you might imagine, we have to use data that is directly comparable. There are many other university missions, for example the teaching mission, which is really important, but it is a nightmare to try and measure as soon as you go across an international boundary. I’ll give you a really clear example of that: imagine we wanted to rank or evaluate universities by graduate employment rates. The challenge there is the graduate employment rate is affected by the natural unemployment rates where you are. If you have a university in New York, what’s the New York unemployment rate? … Then you have Singapore, which has an official unemployment rate of 0%. As soon as you look across international boundaries, you hit these challenges, so we have to have metrics that are consistent, that have some meaning. We look at some measures around teaching, but they are mostly focused on the input, so how much resource a university has to put into the teaching mission. We look at some metrics around research, both input to research and also a measure of the output. Then we look at some measures around internationalization and industry links. … We effectively use a purchase price parity measure, so that allows us to effectively say how many units of local currency would it take to buy $1 worth of stuff. … It gives us a way of saying, yes, you’re a university in Singapore, but your cost basis isn’t the same as the university in California.

Of those factors, I think one of the interesting ones that stands out, and we keep coming back to, is this idea of internationalization. All of the work that’s been done suggests that universities that have more of an international outlook are more successful. They have a better learning environment, a better teaching environment, and a better research environment. There is some evidence around the citations as well; if you do international research collaborations, they tend to have a better record when you look at the bibliometric data.

Editor’s note: Duncan Ross and Francine Bennett will co-present two sessions at Strata + Hadoop World London: Using data for EVIL IV – The Journey Home and The best university in the world.

Related resources:

 

Stream processing and messaging systems for the IoT age

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: M.C. Srivas on streaming, enterprise grade systems, the Internet of Things, and data for social good.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with M.C. Srivas, co-founder of MapR and currently Chief Architect for Data at Uber. We discussed his long career in data management and his experience building a variety of distributed systems. In the course of his career, Srivas has architected key components that now comprise many data platforms (distributed file system, database, query engine, messaging system, etc.).

As Srivas is quick to point out, for these systems to be widely deployed in the enterprise, they require features like security, disaster recovery, and support for multiple data centers. We covered many topics in big data, with particular emphasis on real-time systems and applications. Below are some highlights from our conversation:

Applications and systems that run on multiple data centers

Ad serving has a limit of about 70 to 80 milliseconds where you have to reply to an advertisement. When you click on a Web page, the banner ads and the ads on the side and the bottom have to be served within 80 milliseconds. People place data centers across the world near each of the major centers where there’s a lot of people, where there’s a lot of activity. Many of our customers have data centers in Japan, in China, in Singapore, in Hong Kong, in India, in Russia, in Germany, across the United States, and worldwide. However, the billing is typically consolidated so that they bring data from all these data centers into central data centers where they process the entire clickstream and understand how to bill it back to their customers.

… Then they need a clean way to bring these clickstreams back into the central data centers, maybe running in the U.S. or in Japan or Germany, or somewhere where the consolidation on the overall view of the customer is created. Typically, this has been done by running completely independent Kafka systems in each place. As soon as that happens, the producers and the consumers are not coordinated across data centers. Think about a data center in Japan that has a Kafka cluster running. Well, it cannot failover to the Kafka cluster in Hong Kong because that’s a completely independent cluster and doesn’t understand what has been consumed and what has been produced in Japan. If a consumer who was consuming things from the Japanese Kafka moved to the Hong Kong Kafka, they would get garbage. This is the main problem that a lot of customers asked us to solve.

… The data sources have now gone not into a few data centers, but into millions of data centers. Think about every self-driving car. Every self-driving car is a data center in itself. It generates so much data. Think about a plane flying. A plane flying is a full data center. There’s 400 people on the plane, it’s a captive audience, and there’s enough data generated just for the preventative maintenance kind of stuff on the plane anyways. That’s the thinking behind MapR Streams—what do we need for the Internet of Things scale.

Streaming and messaging systems for IoT

A file system is very passive. You write some files, read some files, and how interesting could that get? If I look at a streaming system, what we’re looking for is completely real time. That is, if a publisher publishes something, then all listeners who want to listen to what the publisher is saying will get notified within five milliseconds inside the same data center. Five milliseconds to get a notification saying, “Hey, this was published.” Instantaneous almost. If I cross data centers, let’s say our data center halfway across the world, and you publish something, let’s say, in Japan and the person in South Africa or somewhere can get that information in under a second. They’ll be notified of that. There’s a push that we do so they get notified of it under a second, at a scale that’s billions of messages per second.

… We have learned from Kafka, we have learned from Tibco, we have learned from RabbitMQ and so many other technologies that have preceded us. We learned a lot from watching all those things, and they have paved a way for us. I think what we’ve done is now taking it to the next level, which is what we really need for IoT.

Powering the world’s largest biometric identity system

We implemented this thing in Aadhaar, a biometric identity project. It links you to your banking, your hospital admissions, all the records—whether it’s school admissions, hospital admissions, even airport entry, passport, pension payments.

… There’s about a billion people online right now. There’s another 300 million to go, but what I wanted to point out is that it’s completely digitized. If you want to withdraw money from an ATM, you put your fingerprint and take the money out. You don’t need a card.

… There was a flood in Chennai last November/December. Massive floods. It rained like it’s never rained before. It rained continuously for two months, and the houses were submerged in 10 feet of water. People lost everything—the entire Tamil Nadu state in India, people lost everything. But when they were rescued, they still had their fingerprints and they could access everything. Their bank accounts, their records, and everything because the Aadhaar project was biometrics-based. Really, they lost everything, but they still had it. They could get to everything right away. Think about what happens here if you lose your wallet. All your credit cards, your driver’s license, everything. You don’t have that kind of an issue anymore. That problem was solved.

Editor’s note: This interview took place in mid January 2016, at that time M.C. Srivas served as CTO of MapR. Srivas currently serves as the Chief Architect for Data at Uber and is a member of the Board of Directors of MapR.

Related resources: