Building machine learning solutions that can withstand adversarial attacks

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show Podcast: Parvez Ahammad on minimal supervision, and the importance of explainability, interpretability, and security.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the Data Show, I spoke with Parvez Ahammad, who leads the data science and machine learning efforts at Instart Logic. He has applied machine learning in a variety of domains, most recently to computational neuroscience and security. Along the way, he has assembled and managed teams of data scientists and has had to grapple with issues like explainability and interpretability, ethics, insufficient amount of labeled data, and adversaries who target machine learning models. As more companies deploy machine learning models into products, it’s important to remember there are many other factors that come into play aside from raw performance metrics.

Here are some highlights from our conversation:
Continue reading “Building machine learning solutions that can withstand adversarial attacks”

Using Apache Spark to predict attack vectors among billions of users and trillions of events

[A version of this post appears on the O’Reilly Radar.]

The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher, TuneIn, iTunes, SoundCloud, RSS.

In this episode of the O’Reilly Data Show, I spoke with Fang Yu, co-founder and CTO of DataVisor. We discussed her days as a researcher at Microsoft, the application of data science and distributed computing to security, and hiring and training data scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft, the startup has developed large-scale unsupervised algorithms on top of Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that time tools that employed machine learning and big data were still rare. More recently, with the rise of tools like Apache Spark and Apache Kafka, I’m starting to come across many more security professionals who incorporate large-scale machine learning and distributed systems into their software platforms and consulting practices.

Continue reading “Using Apache Spark to predict attack vectors among billions of users and trillions of events”

Analytic engines that factor in security labels

[A version of this post appears on the O’Reilly Strata blog.]

Originated by the NSA, Apache Accumulo is a BigTable inspired data store known for being highly scalable and for its interesting security model. Federal agencies and Defense contractors have deployed Accumulo on clusters of a thousand or more servers. It also uses “cell-level” security to control access to values stored in individual cells1.

What Accumulo was lacking were easy-to-use, standard analytic engines that allow users to interact with data. The release of Sqrrl Enterprise this past week fills that gap. Sqrrl Enterprise provides an initial set of analytic engines for the Accumulo ecosystem2. It includes support for interactive SQL, fulltext search, and queries over graph data. Each of these engines takes into account security labels placed on data: since every data object ingested into Sqrrl has a security label, (query & analytic) results incorporate those access levels. Analysts interact with data as they normally would. For example Sqrrl’s indexing technology accounts for security labels, and search queries are written in standard Lucene syntax. Reminiscent of the Phoenix project for HBase3, SQL queries4 in Sqrrl are converted into optimized Accumulo iterators.

Continue reading “Analytic engines that factor in security labels”