[A version of this post appears on the O’Reilly Radar blog.]
Back in 2008, when we were working on what became one of the first papers on big data technologies, one of our first visits was to LinkedIn’s new “data” team. Many of the members of that team went on to build interesting tools and products, and team manager DJ Patil emerged as one of the best-known data scientists. I recently sat down with Patil to talk about his new ebook (written with Hilary Mason) and other topics in data science and big data.
Here are a few of the topics we touched on:
Proliferation of programs for training and certifying data scientists
Patil and I are both ex-academics who learned learned “data science” in industry. In fact, up until a few years ago one acquired data science skills via “on-the-job training.” But a new job title that catches on usually leads to an explosion of programs (I was around when master’s programs in financial engineering took off). Are these programs the right way to acquire the necessary skills? Patil isn’t sure:
“We should call a spade to spade which is [how] you and I both saw that master’s of financial engineering. The MIS degree, the information sciences degree. Many of these became effectively, in the perception of people’s minds at this stage, as second-rate degrees to computer science, or math, or physics. My fear is that the data science degree will become that. That would suck. That would be terrible. I think it’s very reasonable to say, “Hey, that data science can bloom into something much more organic.” Informatics and biophysics are good examples of areas that have done that. What is a right curriculum and the right things that are in there? My fear right now is that it’s overly geared toward consumer Internet products versus all the great things that can be done in social sciences and government, enterprise technology, medicine, health, hospital — all these areas I think are wide open, and it’s unclear in early stages on how to do that.”
Patil’s previous ebook covered some of his experiences building data products at LinkedIn. We talked about how the ideas he laid out are playing out beyond Silicon Valley:
“Yeah, I think it’s starting to emerge a little bit more, but I think it’s still very Silicon Valley centric. I think the thing we’re starting to see is when people say “data product,” they’re no longer restricted in how they think of it; it can be a whole company.
… I think it absolutely can be the government, and I think we’re going to see a lot more of that, the president signing an executive order that says, “Hey, everything has to be machine readable.” One of the big things we’re going to see over the next decade is how do we start really unlocking the value proposition from the genome, the genome to the phenome, the phenotyping to the medical records, and the outcomes of all these things. How does that all start to come together; that’s the data problem at the end of the day. One of the things that we’ll start realizing is that part of this is a numbers game, the more people who have access to their genome — what are the great things we might be able to unlock in terms of new pharmaceuticals and new treatments and understanding who we are?”
One of the highlights of my conversation with Patil was our discussion on ethics — a subject that we’ve both been thinking about a lot. In particular, one of the things we’re following closely is the growing number of data scientists willing to take into account the (cultural) impact of models and data collection:
“Yes, and I think the thing that I’m happy about is many of those in the data science communities are the first to raise their hands about calling this an important item. I think what we’re going to start seeing as a critical component for the chief data officer is data ethics — just because you can it doesn’t mean you should. There has been a number of times where I’ve worked with data and people, where people asked, “What is the implication of us doing this?” Implication a lot of times is this perception, how’s this going to make someone feel? Is it going to be good? Is it going to be bad? What are the long-term aspects that we have to think through in putting this out there?
… Another issue I think that will be public debate is, “Should we be allowing these things to happen?” I think a lot of times people are most often concerned about the consumer Internet companies; I think people often forget about all these data brokers and other people who have been collecting this stuff, and the data is not even transparent to us. I’m not trying to give us a pass — to start to redirect the conversation away from Silicon Valley. It’s more a way of saying that we need to have a conversation where we talk about where our data is, how do we have control of it?
… I don’t [think many of the data science training programs extensively cover ethics]. I was very fortunate in my training to be required to go through ethics, an ethics class in very traditional style. I can’t tell you how many times that class has come to aid. Just simple questions — whether at LinkedIn, RelateIQ, the government, whatever — they always come to me because they give me a formal way to think about it and to have a conversation, because you hold incredible power when you have access to the data; you have to be able to ask yourself, “Should we be doing this?” Or, “How should we go about doing it?”
Make sure you read the newly released ebook Data Driven: Creating a Data Culture by DJ Patil and Hilary Mason. We also recommend DJ’s previous ebook: Data Jujitsu: The art of turning data into product.