Using Apache Spark to predict attack vectors among billions of users and trillions of events

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Fang Yu, co-founder and CTO of DataVisor. We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft, the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices.

Continue reading “Using Apache Spark to predict attack vectors among billions of users and trillions of events”

Tricycle Diaries

On a recent visit to the Philippines, I found myself gawking at two iconic modes of (public) transportation, the tricycle and the jeepney. They remain major sources of gridlock, chaos, and pollution, and many local residents would love to see them banned from the streets of Metro Manila. I doubt that will happen anytime soon, as they remain a cheap mode of public transportation, for a city lacking alternatives – save for a train system that doesn’t really cover large swaths of the metropolis. (For a good overview of traffic in Metro Manila, see this recent Economist article.)


Hanging on to Jeepneys in the manner above, is technically illegal (citation + fine) but it doesn’t seem to deter passengers much. Apparently people indulge in an even riskier practice with tricycles (also illegal). The photos below were taken while I was an Uber passenger in cars trailing these tricycles:


While most tricycles are motorized, many in the Intramuros and “Old Manila” area were pedal powered:


Jeepney (“the king of the road”)

The Jeepney as a “school bus”:

Packed Jeepney, and it’s not even rush hour: