Matteo Riondato

Contact info

Hypothesis Testing and Statistically-sound Pattern Mining

Time and Location    Slides    2-pager    Introduction    Outline    Meet the Tutors    Bibliography   

Time and Location

SDM'21, May 1, 2021, 10.30am EDT

Slides

Slides

Slides from the KDD'19 edition of the tutorial

^top

2-page summary

A 2-page summary of the tutorial, including the outline.

Introduction

The availability of massive datasets has highlighted the need of computationally efficient and statistically-sound methods to extracts patterns while providing rigorous guarantees on the quality of the results, in particular with respect to false discoveries.

In this tutorial we survey recent methods that properly combine computational and statistical considerations to efficiently mine statistically reliable patterns from large datasets. We start by introducing the fundamental concepts in statistical hypothesis testing, which may not be familiar to everyone in the data mining community. We then explain how the computational and statistical challenges in pattern mining have been tackled in different ways. Finally, we describe the application of these methods in areas such as market basket analysis, subgraph mining, social networks analysis, and cancer genomics.

The purpose of this tutorial is to introduce the audience to statistical hypothesis testing, to emphasize the importance of properly balancing the computational and statistical aspects of pattern mining, highlighting the usefulness of techniques that do so for the data mining researcher, and also to encourage further research in this area.

^top

Outline

  1. Introduction and Theoretical Foundations (1 hour)
    1. Testing a single hypothesis: setting, basic concepts, and applications to social networks and computational biology
    2. Fundamental tests: Fisher's test, χ2 test, Barnard's test, A/B testing
    3. The challenge of testing multiple hypotheses: Family-Wise Error Rate and False Discovery Rate
    4. Taming the challenge: the Bonferroni-Holm procedure and the Benjamini-Yekutieli correction
  2. Mining Statistically-Sound Patterns (1 hour and 30 minutes)
    1. Computational and statistical challenges in pattern mining
    2. Computational aspects: LAMP, permutation testing, TopKWY
    3. Statistical aspects: hold-out approach and layered critical values, a threshold for significant pattern mining, true frequent itemsets
    4. Other statistical measures: emerging patterns, discriminative patterns, significant association rules, significant subgroups
    5. Applications: subgraph mining, cancer genomics, computational biology, and survival analysis
  3. Recent developments and advanced topics (30 minutes)
    1. Removing assumptions on the data generating process
    2. Data-dependent hypothesis weighting
    3. Conclusions, future directions, and discussion

^top

Organizers

Leonardo Pellegrina is a Postdoctoral Researcher at the Department of Information Engineering of the University of Padova, and a Visiting Research Fellow at Brown University. His research activities focus on efficient and statistically sound algorithms for pattern discovery from large data, with applications to computational biology. He has presented a tutorial a ACM KDD'19.

Matteo Riondato is an assistant professor of computer science at Amherst College, and a visiting faculty at Brown University. His research focuses on randomized approximation algorithms for knowledge discovery, data mining, and machine learning. His works received best-of-conference awards at SIAM SDM'14, ACM KDD'16, IEEE ICDM'18, and WSDM'21. He has presented tutorials at ACM KDD'15, ACM CIKM'15, ECML PKDD'15, WWW'16, and ACM KDD'19.

Fabio Vandin is a professor at the Department of Information Engineering at the University of Padova (Italy). Prior to joining the University of Padova he was an assistant professor of research at Brown University and an assistant professor at the University of Southern Denmark, where he holds an adjunct associate professor of computer science position. His research interests are in algorithms for statistically significant pattern mining and machine learning, and in applications to large datasets from molecular biology. He received the best paper award at RECOMB'13. His work has been funded in part by the National Science Foundation (NSF), the National Institute of Health (NIH), and the Italian Ministry for Education, University and Research (MIUR). He has presented tutorials at the Computational Genomics Summer Institute (CGSI) of UCLA in 2017, and in 2019.

^top

Bibliography

BibTeX collection

^top