Bits from the Data Store

Semi-regular field notes from the world of data:

Michael Jordan (“ask me anything”): The distinguished machine learning and Bayesian researcher from UC Berkeley’s AMPLab has an interesting perspective on machine learning and statistics.

… while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

knitr: Is an R package for dynamic report generation. Among other things it lets you embed R code within Markdown and LaTeX.

Computer Security and Data Science: A nice curated collection of papers on topics such as intrusion detection, anomaly detection, Internet scale data collection, malware analysis, and intrusion/breach reports.

SociaLite: A distributed query language for large-scale graph analysis that targets Python users. Data is stored in (in-memory) tables and programming logic is expressed in rules, which from the Quick Start code aren’t as easy to grok as SQL.

Upcoming Webcasts:

Chuck Yarbrough, Building a Data Refinery (2014-09-23); this webcast is sponsored by Pentaho
Patrick Wendell, Apache Spark 1.1 and Beyond! (2014-10-02)

Bits from the Data Store

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Gradient Flow