[A version of this article appears on the O’Reilly Radar.]
Having started my career in industry, working on problems in finance, I’ve always appreciated how challenging it is to build consistently profitable systems in this extremely competitive domain. When I served as quant at a hedge fund in the late 1990s and early 2000s, I worked primarily with price data (time-series). I quickly found that it was difficult to find and sustain profitable trading strategies that leveraged data sources that everyone else in the industry examined exhaustively. In the early-to-mid 2000s the hedge fund industry began incorporating many more data sources, and today you’re likely to find many finance industry professionals at big data and data science events like Strata + Hadoop World.
During the latest episode of the O’Reilly Data Show Podcast, I had a great conversation with one of the leading data scientists in finance: Gary Kazantsev runs the R&D Machine Learning group at Bloomberg LP. As a former quant, I wanted to know the types of problems Kazantsev and his group work on, and the tools and techniques they’ve found useful. We also talked about data science, data engineering, and recruiting data professionals for Wall Street.
Text mining in finance
One thing I did not do “back in the day” was work with unstructured text. Kazantsev’s group is considered one of the leading text mining outfits in finance, and he’s done many presentations on the topic over the last few years. One of the things that remains unchanged is that at the end of the day, in finance, tools have to impact how users make decisions (trades). Kazantsev described some of their projects:
Text analysis is one of them … we do a number of things that essentially amount to producing financial indicators from unstructured text. We take in news stories and produce time series. Sentiment analysis is one product … Other more general market impact indicators are also in this area — topic clustering, topic classification, novelty detection; those are all projects that we work on in this area. … Our view of news is actually very broad. Yes, publicly available news are news, but we also generate an enormous amount of our own content, we take in a number of different third-party contributor feeds, and we collect information from social media. Roughly speaking, at this point, we ingest something like 1.2 million stories per day, give or take. … There’s actually a long pipeline that the stories go through. It ranges from language detection, to named entity recognition, disambiguation, topic classification, and then more complicated things like sentiment analysis.
If you think about [sentiment analysis] from a machine learning perspective, it’s a fairly standard text classification problem. There has been a lot of work in this area, starting with the original papers. … The important part, really, in this case is not necessarily the set of techniques that work best, but how to pose the problem so that it is actually useful for finance professionals. As far as techniques, if you’re interested in large margin methods, support vector machines, you will still come out on top for most of these things, at least in our experience for the domain. The actual challenge is asking the right question, then furthermore, doing enough in feature engineering and also enough statistics to convince ourselves and our clients that what is being produced actually makes sense, that it impacts financial markets in some way.
Pricing financial products
How do you price a financial product that is illiquid (in that it doesn’t trade often)? In particular, suppose pricing isn’t set on a regular basis, so you don’t have much price history to work with. Having worked on problems of this nature (in the context of derivatives) back when I was a quant, I can appreciate how difficult they can be. In fact, the proper pricing and risk assessment of complex mortgage derivatives was at the heart of the financial crisis of 2008. Kazantsev describes how they currently approach this problem:
If you have instruments that are illiquid, which trade infrequently, it is a fairly nontrivial problem to value them appropriately, and a lot of work. In fact, the whole industry on Wall Street is dedicated to actually pricing those in one way or another. If you connect a number of different estimates of prices for these securities, you can combine them using an ensemble model, and you will get the consensus recommended [value]. … If you use this consensus value, no matter how good it is, and you actually trade one of these instruments, there is inevitably going to be discrepancies between the traded price and the consensus price. There are some variables that clearly are not being captured in this consensus. To me, this looks basically like a machine learning problem. There are variables that describe the instrument, there are variables that describe the trade, there is the consensus value and then there is the actual traded value. Build a model to explain the discrepancies. … Unlike individual contributors in this marketplace, we see basically more trades. We tend to have better basis for inference.
Recruiting data professionals
Among the Wall Street firms I’ve interacted with, Bloomberg ranks with the most active at evaluating new technologies and recruiting data scientists and data engineers experienced in using the latest technologies. I asked Kazantsev if he still actively recruits PhDs in Math, Physics, Computer Science, and other quantitative disciplines (the answer: yes). He also explained that they are beginning to help shape data science curriculums at several institutions and recruit graduates from many such programs:
Physicists already tend to make for fairly good software engineers, especially people who do Monte Carlo simulations or do particle physics because you have to work with a lot of data. The field itself teaches you a certain empiricist attitude to understand that.
Generally speaking, when we recruit, we look for people who, again, have an empiricist attitude toward data. We look for people who have a certain amount of mathematical intuition, a certain amount of curiosity, definitely. Preferably, people who have worked on data problems. Not necessarily big data problems, medium data problems, even small data problems, but using these kinds of methods. It’s not enough to use the methods in question, say, regression or classification or what have you. … We look for people who tend to try to understand why these things work, how they work. What assumptions do they make? That’s one side. The other side is software engineering. My group is structured somewhat differently than many data science groups elsewhere. We deliver products to clients, so we do everything. Even a project that is posed in my group can be taken from a blank sheet of paper to delivery to clients, which involves [delivering actual products and features to Bloomberg’s users].