Statistical Strategies for the Analysis of Large and Complex Data

This talk will focus on challenges that arise when faced with the analysis of datasets that are too large for standard statistical methods to work properly. While one can always go for the expensive solution of getting access to a more powerful computer or cluster, it turns out that there are some simple statistical strategies that can be used. In particular, we’ll discuss the use of so called “Divide and Recombine” strategies that relegate some of the work to be done in a distributed fashion, for example via Hadoop. Combining these strategies with clever subsampling and data coarsening ideas can result in datasets that are small enough to manage on a standard desktop machine, with only minimal efficiency loss. The ideas are illustrated with data from the Australian Red Cross.