Publications with awards

Return to full publication list

ICDM-2022 Best Paper Runner-up Award
Extremely Fast Hoeffding Adaptive Tree.
Manapragada, C., Salehi, M., & Webb, G. I.
Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), pp. 319-328, 2022.
[Bibtex] [Abstract]

@InProceedings{ManapragadaEtAl22,
author = {Manapragada, Chaitanya and Salehi, Mahsa and Webb, Geoffrey I.},
booktitle = {Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM)},
title = {Extremely Fast Hoeffding Adaptive Tree},
year = {2022},
pages = {319-328},
publisher = {IEEE},
abstract = {Many real-world data streams are non-stationary.
Subject to concept drift, the distributions change over time.
To retain accuracy in the face of such drift, online decision
tree learners must discard parts of the tree that are no longer
accurate and replace them by new subtrees that reflect the
new distribution. The longstanding state-of-the-art online
decision tree learner for non-stationary streams is Hoeffding
Adaptive Tree (HAT), which adds a drift detection and response
mechanism to the classic Very Fast Decision Tree (VFDT) online
decision tree learner. However, for stationary distributions,
VFDT has been superseded by Extremely Fast Decision Tree
(EFDT), which uses a statistically more efficient learning
mechanism than VFDT. This learning mechanism needs to be
coupled with a compensatory revision mechanism that can
compensate for circumstances where the learning mechanism is
too eager.
The current work develops a strategy to combine the best
of both these state-of-the-art approaches, exploiting both the
statistically efficient learning mechanism from EFDT and the
highly effective drift detection and response mechanism of HAT.
To do so requires decoupling of the EFDT splitting and revision
mechanisms, as the latter incorrectly triggers the HAT drift
detection mechanism. The resulting learner, Extremely Fast
Hoeffding Adaptive Tree, responds to drift more rapidly and
effectively than either HAT or EFDT, and attains a statistically
significant advantage in accuracy even on stationary streams.},
comment = {ICDM-2022 Best Paper Runner-up Award},
doi = {10.1109/ICDM54844.2022.00042},
keywords = {Concept Drift, efficient ml},
related = {learning-from-non-stationary-distributions},
}

ABSTRACT Many real-world data streams are non-stationary. Subject to concept drift, the distributions change over time. To retain accuracy in the face of such drift, online decision tree learners must discard parts of the tree that are no longer accurate and replace them by new subtrees that reflect the new distribution. The longstanding state-of-the-art online decision tree learner for non-stationary streams is Hoeffding Adaptive Tree (HAT), which adds a drift detection and response mechanism to the classic Very Fast Decision Tree (VFDT) online decision tree learner. However, for stationary distributions, VFDT has been superseded by Extremely Fast Decision Tree (EFDT), which uses a statistically more efficient learning mechanism than VFDT. This learning mechanism needs to be coupled with a compensatory revision mechanism that can compensate for circumstances where the learning mechanism is too eager. The current work develops a strategy to combine the best of both these state-of-the-art approaches, exploiting both the statistically efficient learning mechanism from EFDT and the highly effective drift detection and response mechanism of HAT. To do so requires decoupling of the EFDT splitting and revision mechanisms, as the latter incorrectly triggers the HAT drift detection mechanism. The resulting learner, Extremely Fast Hoeffding Adaptive Tree, responds to drift more rapidly and effectively than either HAT or EFDT, and attains a statistically significant advantage in accuracy even on stationary streams.

Clarivate Web of Science Highly Cited Paper 2022 - 2024
iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.
Chen, Z., Zhao, P., Li, C., Li, F., Xiang, D., Chen, Y., Akutsu, T., Daly, R. J., Webb, G. I., Zhao, Q., Kurgan, L., & Song, J.
Nucleic Acids Research, 49(10), Art. no. e60, 2021.
[Bibtex] [Abstract]

@Article{ChenEtAl21,
author = {Chen, Zhen and Zhao, Pei and Li, Chen and Li, Fuyi and Xiang, Dongxu and Chen, Yong-Zi and Akutsu, Tatsuya and Daly, Roger J and Webb, Geoffrey I and Zhao, Quanzhi and Kurgan, Lukasz and Song, Jiangning},
journal = {Nucleic Acids Research},
title = {{iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization}},
year = {2021},
issn = {0305-1048},
number = {10},
volume = {49},
abstract = {{Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.}},
articlenumber = {e60},
comment = {Clarivate Web of Science Highly Cited Paper 2022 - 2024},
doi = {10.1093/nar/gkab122},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.}

Clarivate Web of Science Highly Cited Paper 2022 - 2024
Most Highly Cited Paper Published In Data Mining and Knowledge Discovery in 2020
InceptionTime: Finding AlexNet for Time Series Classification.
Ismail Fawaz, H., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D. F., Weber, J., Webb, G. I., Idoumghar, L., Muller, P., & Petitjean, F.
Data Mining and Knowledge Discovery, 34, 1936-1962, 2020.
[Bibtex] [Abstract]

@Article{fawaz2019inceptiontime,
author = {Ismail Fawaz, Hassan and Benjamin Lucas and Germain Forestier and Charlotte Pelletier and Daniel F. Schmidt and Jonathan Weber and Geoffrey I. Webb and Lhassane Idoumghar and Pierre-Alain Muller and Francois Petitjean},
journal = {Data Mining and Knowledge Discovery},
title = {InceptionTime: Finding AlexNet for Time Series Classification},
year = {2020},
pages = {1936-1962},
volume = {34},
abstract = {This paper brings deep learning at the forefront of research into time series classification (TSC). TSC is the area of machine learning tasked with the categorization (or labelling) of time series. The last few decades of work in this area have led to significant progress in the accuracy of classifiers, with the state of the art now represented by the HIVE-COTE algorithm. While extremely accurate, HIVE-COTE cannot be applied to many real-world datasets because of its high training time complexity in O(N^2 . T^4) for a dataset with N time series of length T. For example, it takes HIVE-COTE more than 8 days to learn from a small dataset with N = 1500 time series of short length T = 46. Meanwhile deep learning has received enormous attention because of its high accuracy and scalability. Recent approaches to deep learning for TSC have been scalable, but less accurate than HIVE-COTE. We introduce InceptionTime - an ensemble of deep Convolutional Neural Network models, inspired by the Inception-v4 architecture. Our experiments show that InceptionTime is on par with HIVE-COTE in terms of accuracy while being much more scalable: not only can it learn from 1500 time series in one hour but it can also learn from 8M time series in 13 h, a quantity of data that is fully out of reach of HIVE-COTE.},
comment = {Clarivate Web of Science Highly Cited Paper 2022 - 2024},
comment2 = {Most Highly Cited Paper Published In Data Mining and Knowledge Discovery in 2020},
doi = {10.1007/s10618-020-00710-y},
issue = {6},
keywords = {time series},
related = {scalable-time-series-classifiers},
url = {https://rdcu.be/b6TXh},
}

ABSTRACT This paper brings deep learning at the forefront of research into time series classification (TSC). TSC is the area of machine learning tasked with the categorization (or labelling) of time series. The last few decades of work in this area have led to significant progress in the accuracy of classifiers, with the state of the art now represented by the HIVE-COTE algorithm. While extremely accurate, HIVE-COTE cannot be applied to many real-world datasets because of its high training time complexity in O(N^2 . T^4) for a dataset with N time series of length T. For example, it takes HIVE-COTE more than 8 days to learn from a small dataset with N = 1500 time series of short length T = 46. Meanwhile deep learning has received enormous attention because of its high accuracy and scalability. Recent approaches to deep learning for TSC have been scalable, but less accurate than HIVE-COTE. We introduce InceptionTime - an ensemble of deep Convolutional Neural Network models, inspired by the Inception-v4 architecture. Our experiments show that InceptionTime is on par with HIVE-COTE in terms of accuracy while being much more scalable: not only can it learn from 1500 time series in one hour but it can also learn from 8M time series in 13 h, a quantity of data that is fully out of reach of HIVE-COTE.

Second Most Highly Cited Paper Published in Data Mining and Knowledge Discovery in 2020; Clarivate Web of Science Highly Cited Paper 2024
ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels.
Dempster, A., Petitjean, F., & Webb, G. I.
Data Mining and Knowledge Discovery, 34, 1454-1495, 2020.
[Bibtex] [Abstract]

@Article{dempster2020rocket,
author = {Angus Dempster and Francois Petitjean and Geoffrey I. Webb},
journal = {Data Mining and Knowledge Discovery},
title = {ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels},
year = {2020},
pages = {1454-1495},
volume = {34},
abstract = {Most methods for time series classification that attain state-of-the-art accuracy have high computational complexity, requiring significant training time even for smaller datasets, and are intractable for larger datasets. Additionally, many existing methods focus on a single type of feature such as shape or frequency. Building on the recent success of convolutional neural networks for time series classification, we show that simple linear classifiers using random convolutional kernels achieve state-of-the-art accuracy with a fraction of the computational expense of existing methods. Using this method, it is possible to train and test a classifier on all 85 'bake off' datasets in the UCR archive in <2h, and it is possible to train a classifier on a large dataset of more than one million time series in approximately 1 h.},
comment = {Second Most Highly Cited Paper Published in Data Mining and Knowledge Discovery in 2020; Clarivate Web of Science Highly Cited Paper 2024},
doi = {10.1007/s10618-020-00701-z},
issue = {5},
keywords = {time series, efficient ml},
related = {scalable-time-series-classifiers},
url = {https://rdcu.be/c1zg4},
}

ABSTRACT Most methods for time series classification that attain state-of-the-art accuracy have high computational complexity, requiring significant training time even for smaller datasets, and are intractable for larger datasets. Additionally, many existing methods focus on a single type of feature such as shape or frequency. Building on the recent success of convolutional neural networks for time series classification, we show that simple linear classifiers using random convolutional kernels achieve state-of-the-art accuracy with a fraction of the computational expense of existing methods. Using this method, it is possible to train and test a classifier on all 85 'bake off' datasets in the UCR archive in <2h, and it is possible to train a classifier on a large dataset of more than one million time series in approximately 1 h.

Third Most Highly Cited Paper Published in Data Mining and Knowledge Discovery in 2020
TS-CHIEF: A Scalable and Accurate Forest Algorithm for Time Series Classification.
Shifaz, A., Pelletier, C., Petitjean, F., & Webb, G. I.
Data Mining and Knowledge Discovery, 34(3), 742-775, 2020.
[Bibtex] [Abstract]

@Article{shifazetal2019,
author = {Shifaz, Ahmed and Pelletier, Charlotte and Petitjean, Francois and Webb, Geoffrey I},
journal = {Data Mining and Knowledge Discovery},
title = {TS-CHIEF: A Scalable and Accurate Forest Algorithm for Time Series Classification},
year = {2020},
number = {3},
pages = {742-775},
volume = {34},
abstract = {Time Series Classification (TSC) has seen enormous progress over the last two decades. HIVE-COTE (Hierarchical Vote Collective of Transformation-based Ensembles) is the current state of the art in terms of classification accuracy. HIVE-COTE recognizes that time series data are a specific data type for which the traditional attribute-value representation, used predominantly in machine learning, fails to provide a relevant representation. HIVE-COTE combines multiple types of classifiers: each extracting information about a specific aspect of a time series, be it in the time domain, frequency domain or summarization of intervals within the series. However, HIVE-COTE (and its predecessor, FLAT-COTE) is often infeasible to run on even modest amounts of data. For instance, training HIVE-COTE on a dataset with only 1500 time series can require 8 days of CPU time. It has polynomial runtime with respect to the training set size, so this problem compounds as data quantity increases. We propose a novel TSC algorithm, TS-CHIEF (Time Series Combination of Heterogeneous and Integrated Embedding Forest), which rivals HIVE-COTE in accuracy but requires only a fraction of the runtime. TS-CHIEF constructs an ensemble classifier that integrates the most effective embeddings of time series that research has developed in the last decade. It uses tree-structured classifiers to do so efficiently. We assess TS-CHIEF on 85 datasets of the University of California Riverside (UCR) archive, where it achieves state-of-the-art accuracy with scalability and efficiency. We demonstrate that TS-CHIEF can be trained on 130 k time series in 2 days, a data quantity that is beyond the reach of any TSC algorithm with comparable accuracy.},
comment = {Third Most Highly Cited Paper Published in Data Mining and Knowledge Discovery in 2020},
doi = {10.1007/s10618-020-00679-8},
keywords = {time series, efficient ml},
related = {scalable-time-series-classifiers},
url = {https://rdcu.be/c1zg6},
}

ABSTRACT Time Series Classification (TSC) has seen enormous progress over the last two decades. HIVE-COTE (Hierarchical Vote Collective of Transformation-based Ensembles) is the current state of the art in terms of classification accuracy. HIVE-COTE recognizes that time series data are a specific data type for which the traditional attribute-value representation, used predominantly in machine learning, fails to provide a relevant representation. HIVE-COTE combines multiple types of classifiers: each extracting information about a specific aspect of a time series, be it in the time domain, frequency domain or summarization of intervals within the series. However, HIVE-COTE (and its predecessor, FLAT-COTE) is often infeasible to run on even modest amounts of data. For instance, training HIVE-COTE on a dataset with only 1500 time series can require 8 days of CPU time. It has polynomial runtime with respect to the training set size, so this problem compounds as data quantity increases. We propose a novel TSC algorithm, TS-CHIEF (Time Series Combination of Heterogeneous and Integrated Embedding Forest), which rivals HIVE-COTE in accuracy but requires only a fraction of the runtime. TS-CHIEF constructs an ensemble classifier that integrates the most effective embeddings of time series that research has developed in the last decade. It uses tree-structured classifiers to do so efficiently. We assess TS-CHIEF on 85 datasets of the University of California Riverside (UCR) archive, where it achieves state-of-the-art accuracy with scalability and efficiency. We demonstrate that TS-CHIEF can be trained on 130 k time series in 2 days, a data quantity that is beyond the reach of any TSC algorithm with comparable accuracy.

Clarivate Web of Science Highly Cited Paper 2020
DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites.
Li, F., Chen, J., Leier, A., Marquez-Lago, T., Liu, Q., Wang, Y., Revote, J., Smith, I. A., Akutsu, T., Webb, G. I., Kurgan, L., & Song, J.
Bioinformatics, 36(4), 1057-1065, 2020.
[Bibtex] [Abstract]

@Article{Li2020a,
author = {Li, Fuyi and Chen, Jinxiang and Leier, Andre and Marquez-Lago, Tatiana and Liu, Quanzhong and Wang, Yanze and Revote, Jerico and Smith, A Ian and Akutsu, Tatsuya and Webb, Geoffrey I and Kurgan, Lukasz and Song, Jiangning},
journal = {Bioinformatics},
title = {DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites},
year = {2020},
issn = {1367-4803},
number = {4},
pages = {1057-1065},
volume = {36},
abstract = {{Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the "life and death" process of proteins, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/.Supplementary data are available at Bioinformatics online.}},
comment = {Clarivate Web of Science Highly Cited Paper 2020},
doi = {10.1093/bioinformatics/btz721},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT {Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the "life and death" process of proteins, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/.Supplementary data are available at Bioinformatics online.}

Clarivate Web of Science Highly Cited Paper 2020 - 2024
iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data.
Chen, Z., Zhao, P., Li, F., Marquez-Lago, T. T., Leier, A., Revote, J., Zhu, Y., Powell, D. R., Akutsu, T., Webb, G. I., Chou, K., Smith, I. A., Daly, R. J., Li, J., & Song, J.
Briefings in Bioinformatics, 21(3), 1047-1057, 2020.
[Bibtex] [Abstract]

@Article{10.1093/bib/bbz041,
author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Marquez-Lago, Tatiana T and Leier, Andre and Revote, Jerico and Zhu, Yan and Powell, David R and Akutsu, Tatsuya and Webb, Geoffrey I and Chou, Kuo-Chen and Smith, A Ian and Daly, Roger J and Li, Jian and Song, Jiangning},
journal = {Briefings in Bioinformatics},
title = {iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data},
year = {2020},
issn = {1477-4054},
number = {3},
pages = {1047-1057},
volume = {21},
abstract = {With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.},
comment = {Clarivate Web of Science Highly Cited Paper 2020 - 2024},
doi = {10.1093/bib/bbz041},
keywords = {Bioinformatics and DP140100087},
related = {computational-biology},
}

ABSTRACT With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.

Clarivate Web of Science Hot Paper 2019
Clarivate Web of Science Highly Cited Paper 2019 - 2023
iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites.
Song, J., Wang, Y., Li, F., Akutsu, T., Rawlings, N. D., Webb, G. I., & Chou, K.
Briefings in Bioinformatics, 20(2), 638-658, 2019.
[Bibtex] [Abstract]

@Article{doi:10.1093/bib/bby028,
author = {Song, Jiangning and Wang, Yanan and Li, Fuyi and Akutsu, Tatsuya and Rawlings, Neil D and Webb, Geoffrey I and Chou, Kuo-Chen},
journal = {Briefings in Bioinformatics},
title = {iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites},
year = {2019},
number = {2},
pages = {638-658},
volume = {20},
abstract = {Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.},
comment = {Clarivate Web of Science Hot Paper 2019},
comment2 = {Clarivate Web of Science Highly Cited Paper 2019 - 2023},
doi = {10.1093/bib/bby028},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the accurate prediction of protease-specific substrates and their cleavage sites. Importantly, iProt-Sub represents a significantly advanced version of its successful predecessor, PROSPER. It provides optimized cleavage site prediction models with better prediction performance and coverage for more species-specific proteases (4 major protease families and 38 different proteases). iProt-Sub integrates heterogeneous sequence and structural features and uses a two-step feature selection procedure to further remove redundant and irrelevant features in an effort to improve the cleavage site prediction accuracy. Features used by iProt-Sub are encoded by 11 different sequence encoding schemes, including local amino acid sequence profile, secondary structure, solvent accessibility and native disorder, which will allow a more accurate representation of the protease specificity of approximately 38 proteases and training of the prediction models. Benchmarking experiments using cross-validation and independent tests showed that iProt-Sub is able to achieve a better performance than several existing generic tools. We anticipate that iProt-Sub will be a powerful tool for proteome-wide prediction of protease-specific substrates and their cleavage sites, and will facilitate hypothesis-driven functional interrogation of protease-specific substrate cleavage and proteolytic events.

Clarivate Web of Science Highly Cited Paper 2021 - 2024
Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series.
Pelletier, C., Webb, G. I., & Petitjean, F.
Remote Sensing, 11(5), Art. no. 523, 2019.
[Bibtex] [Abstract]

@Article{PelletierEtAl19,
author = {Pelletier, Charlotte and Webb, Geoffrey I. and Petitjean, Francois},
journal = {Remote Sensing},
title = {Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series},
year = {2019},
issn = {2072-4292},
number = {5},
volume = {11},
abstract = {Latest remote sensing sensors are capable of acquiring high spatial and spectral Satellite Image Time Series (SITS) of the world. These image series are a key component of classification systems that aim at obtaining up-to-date and accurate land cover maps of the Earth’s surfaces. More specifically, current SITS combine high temporal, spectral and spatial resolutions, which makes it possible to closely monitor vegetation dynamics. Although traditional classification algorithms, such as Random Forest (RF), have been successfully applied to create land cover maps from SITS, these algorithms do not make the most of the temporal domain. This paper proposes a comprehensive study of Temporal Convolutional Neural Networks (TempCNNs), a deep learning approach which applies convolutions in the temporal dimension in order to automatically learn temporal (and spectral) features. The goal of this paper is to quantitatively and qualitatively evaluate the contribution of TempCNNs for SITS classification, as compared to RF and Recurrent Neural Networks (RNNs) —a standard deep learning approach that is particularly suited to temporal data. We carry out experiments on Formosat-2 scene with 46 images and one million labelled time series. The experimental results show that TempCNNs are more accurate than the current state of the art for SITS classification. We provide some general guidelines on the network architecture, common regularization mechanisms, and hyper-parameter values such as batch size; we also draw out some differences with standard results in computer vision (e.g., about pooling layers). Finally, we assess the visual quality of the land cover maps produced by TempCNNs.},
articlenumber = {523},
comment = {Clarivate Web of Science Highly Cited Paper 2021 - 2024},
doi = {10.3390/rs11050523},
keywords = {time series, earth observation analytics},
related = {earth-observation-analytics},
}

ABSTRACT Latest remote sensing sensors are capable of acquiring high spatial and spectral Satellite Image Time Series (SITS) of the world. These image series are a key component of classification systems that aim at obtaining up-to-date and accurate land cover maps of the Earth’s surfaces. More specifically, current SITS combine high temporal, spectral and spatial resolutions, which makes it possible to closely monitor vegetation dynamics. Although traditional classification algorithms, such as Random Forest (RF), have been successfully applied to create land cover maps from SITS, these algorithms do not make the most of the temporal domain. This paper proposes a comprehensive study of Temporal Convolutional Neural Networks (TempCNNs), a deep learning approach which applies convolutions in the temporal dimension in order to automatically learn temporal (and spectral) features. The goal of this paper is to quantitatively and qualitatively evaluate the contribution of TempCNNs for SITS classification, as compared to RF and Recurrent Neural Networks (RNNs) —a standard deep learning approach that is particularly suited to temporal data. We carry out experiments on Formosat-2 scene with 46 images and one million labelled time series. The experimental results show that TempCNNs are more accurate than the current state of the art for SITS classification. We provide some general guidelines on the network architecture, common regularization mechanisms, and hyper-parameter values such as batch size; we also draw out some differences with standard results in computer vision (e.g., about pooling layers). Finally, we assess the visual quality of the land cover maps produced by TempCNNs.

Clarivate Web of Science Highly Cited Paper 2019 - 2024
iFeature: A python package and web server for features extraction and selection from protein and peptide sequences.
Chen, Z., Zhao, P., Li, F., Leier, A., Marquez-Lago, T. T., Wang, Y., Webb, G. I., Smith, I. A., Daly, R. J., Chou, K., & Song, J.
Bioinformatics, 2499-2502, 2018.
[Bibtex] [Abstract]

@Article{ChenEtAl18,
author = {Chen, Zhen and Zhao, Pei and Li, Fuyi and Leier, Andre and Marquez-Lago, Tatiana T and Wang, Yanan and Webb, Geoffrey I and Smith, A Ian and Daly, Roger J and Chou, Kuo-Chen and Song, Jiangning},
journal = {Bioinformatics},
title = {iFeature: A python package and web server for features extraction and selection from protein and peptide sequences},
year = {2018},
pages = {2499-2502},
abstract = {Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit.},
comment = {Clarivate Web of Science Highly Cited Paper 2019 - 2024},
doi = {10.1093/bioinformatics/bty140},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit.

Clarivate Web of Science Hot Paper
Clarivate Web of Science Highly Cited Paper 2019, 2020
PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.
Song, J., Li, F., Takemoto, K., Haffari, G., Akutsu, T., Chou, K. C., & Webb, G. I.
Journal of Theoretical Biology, 443, 125-137, 2018.
[Bibtex]

@Article{SongEtAl18,
author = {Song, J. and Li, F. and Takemoto, K. and Haffari, G. and Akutsu, T. and Chou, K. C. and Webb, G. I.},
journal = {Journal of Theoretical Biology},
title = {PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework},
year = {2018},
pages = {125-137},
volume = {443},
comment = {Clarivate Web of Science Hot Paper},
comment2 = {Clarivate Web of Science Highly Cited Paper 2019, 2020},
doi = {10.1016/j.jtbi.2018.01.023},
keywords = {Bioinformatics},
related = {computational-biology},
url = {https://authors.elsevier.com/c/1WWQY57ilzyRc},
}

ABSTRACT

Best Research Paper Award
Efficient search of the best warping window for Dynamic Time Warping.
Tan, C. W., Herrmann, M., Forestier, G., Webb, G. I., & Petitjean, F.
Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 459-467, 2018.
[Bibtex] [Abstract]

@InProceedings{TanEtAl18,
author = {Tan, Chang Wei and Herrmann, Matthieu and Forestier, Germain and Webb, Geoffrey I. and Petitjean, Francois},
booktitle = {Proceedings of the 2018 {SIAM} International Conference on Data Mining},
title = {Efficient search of the best warping window for Dynamic Time Warping},
year = {2018},
pages = {459-467},
abstract = {Time series classification maps time series to labels. The nearest neighbour algorithm (NN) using the Dynamic Time Warping (DTW) similarity measure is a leading algorithm for this task and a component of the current best ensemble classifiers for time series. However, NN-DTW is only a winning combination when its meta-parameter - its warping window - is learned from the training data. The warping window (WW) intuitively controls the amount of distortion allowed when comparing a pair of time series. With a training database of N time series of lengths L, a naive approach to learning the WW requires Omega(N^2 . L^3) operations. This often translates in NN-DTW requiring days for training on datasets containing a few thousand time series only. In this paper, we introduce FastWWSearch: an efficient and exact method to learn WW. We show on 86 datasets that our method is always faster than the state of the art, with at least one order of magnitude and up to 1000x speed-up.},
comment = {Best Research Paper Award},
keywords = {time series, efficient ml},
related = {scalable-time-series-classifiers},
}

ABSTRACT Time series classification maps time series to labels. The nearest neighbour algorithm (NN) using the Dynamic Time Warping (DTW) similarity measure is a leading algorithm for this task and a component of the current best ensemble classifiers for time series. However, NN-DTW is only a winning combination when its meta-parameter - its warping window - is learned from the training data. The warping window (WW) intuitively controls the amount of distortion allowed when comparing a pair of time series. With a training database of N time series of lengths L, a naive approach to learning the WW requires Omega(N^2 . L^3) operations. This often translates in NN-DTW requiring days for training on datasets containing a few thousand time series only. In this paper, we introduce FastWWSearch: an efficient and exact method to learn WW. We show on 86 datasets that our method is always faster than the state of the art, with at least one order of magnitude and up to 1000x speed-up.

Clarivate Web of Science Highly Cited Paper 2019, 2020, 2021
PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy.
Song, J., Li, F., Leier, A., Marquez-Lago, T. T., Akutsu, T., Haffari, G., Chou, K., Webb, G. I., & Pike, R. N.
Bioinformatics, 34(4), 684-687, 2017.
[Bibtex]

@Article{Song2017a,
author = {Song, Jiangning and Li, Fuyi and Leier, Andre and Marquez-Lago, Tatiana T and Akutsu, Tatsuya and Haffari, Gholamreza and Chou, Kuo-Chen and Webb, Geoffrey I and Pike, Robert N},
journal = {Bioinformatics},
title = {PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy},
year = {2017},
number = {4},
pages = {684-687},
volume = {34},
comment = {Clarivate Web of Science Highly Cited Paper 2019, 2020, 2021},
doi = {10.1093/bioinformatics/btx670},
keywords = {Bioinformatics},
related = {computational-biology},
}

ABSTRACT

Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue
A multiple test correction for streams and cascades of statistical hypothesis tests.
Webb, G. I., & Petitjean, F.
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16, pp. 1255-1264, 2016.
[Bibtex] [Abstract]

@InProceedings{WebbPetitjean16,
Title = {A multiple test correction for streams and cascades of statistical hypothesis tests},
Author = {Webb, Geoffrey I. and Petitjean, Francois},
Booktitle = {Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-16},
Year = {2016},
Pages = {1255-1264},
Publisher = {ACM Press},
Abstract = {Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance.
This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed.
To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models.
We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.},
Comment = {Top reviewer score (4.75/5.0), shortlisted for best paper award and invited to ACM TKDE journal KDD-16 special issue},
Doi = {10.1145/2939672.2939775},
Keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087},
Related = {statistically-sound-association-discovery},
Url = {http://dl.acm.org/authorize?N19100}
}

ABSTRACT Statistical hypothesis testing is a popular and powerful tool for inferring knowledge from data. For every such test performed, there is always a non-zero probability of making a false discovery, i.e.~rejecting a null hypothesis in error. Familywise error rate (FWER) is the probability of making at least one false discovery during an inference process. The expected FWER grows exponentially with the number of hypothesis tests that are performed, almost guaranteeing that an error will be committed if the number of tests is big enough and the risk is not managed; a problem known as the multiple testing problem. State-of-the-art methods for controlling FWER in multiple comparison settings require that the set of hypotheses be pre-determined. This greatly hinders statistical testing for many modern applications of statistical inference, such as model selection, because neither the set of hypotheses that will be tested, nor even the number of hypotheses, can be known in advance. This paper introduces Subfamilywise Multiple Testing, a multiple-testing correction that can be used in applications for which there are repeated pools of null hypotheses from each of which a single null hypothesis is to be rejected and neither the specific hypotheses nor their number are known until the final rejection decision is completed. To demonstrate the importance and relevance of this work to current machine learning problems, we further refine the theory to the problem of model selection and show how to use Subfamilywise Multiple Testing for learning graphical models. We assess its ability to discover graphical models on more than 7,000 datasets, studying the ability of Subfamilywise Multiple Testing to outperform the state of the art on data with varying size and dimensionality, as well as with varying density and power of the present correlations. Subfamilywise Multiple Testing provides a significant improvement in statistical efficiency, often requiring only half as much data to discover the same model, while strictly controlling FWER.

Best Research Paper Honorable Mention Award
Scaling log-linear analysis to datasets with thousands of variables.
Petitjean, F., & Webb, G. I.
Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 469-477, 2015.
[Bibtex] [Abstract]

@InProceedings{PetitjeanWebb15,
author = {Petitjean, F. and Webb, G. I.},
booktitle = {Proceedings of the 2015 {SIAM} International Conference on Data Mining},
title = {Scaling log-linear analysis to datasets with thousands of variables},
year = {2015},
pages = {469-477},
abstract = {Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.},
comment = {Best Research Paper Honorable Mention Award},
keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and Learning from large datasets and DP140100087, efficient ml},
related = {scalable-graphical-modeling},
url = {http://epubs.siam.org/doi/pdf/10.1137/1.9781611974010.53},
}

ABSTRACT Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. We have recently shown that, if we ensure that the graph supporting the log-linear model is chordal, log-linear analysis can be applied to datasets with hundreds of variables without sacrificing the statistical soundness [21]. However, further scalability remained limited, because state-of-the-art techniques have to examine every edge at every step of the search. This paper makes the following contributions: 1) we prove that only a very small subset of edges has to be considered at each step of the search; 2) we demonstrate how to efficiently find this subset of edges and 3) we show how to efficiently keep track of the best edges to be subsequently added to the initial model. Our experiments, carried out on real datasets with up to 2000 variables, show that our contributions make it possible to gain about 4 orders of magnitude, making log-linear analysis of datasets with thousands of variables possible in seconds instead of days.

ICDM 2023 10-year Highest Impact Paper Award
One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue
Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification.
Petitjean, F., Forestier, G., Webb, G. I., Nicholson, A., Chen, Y., & Keogh, E.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 470-479, 2014.
[Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl14b,
author = {Petitjean, F. and Forestier, G. and Webb, G. I. and Nicholson, A. and Chen, Y. and Keogh, E.},
booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
title = {Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification},
year = {2014},
pages = {470-479},
abstract = {Recent years have seen significant progress in improving both the efficiency and effectiveness of time series classification. However, because the best solution is typically the Nearest Neighbor algorithm with the relatively expensive Dynamic Time Warping as the distance measure, successful deployments on resource constrained devices remain elusive. Moreover, the recent explosion of interest in wearable devices, which typically have limited computational resources, has created a growing need for very efficient classification algorithms. A commonly used technique to glean the benefits of the Nearest Neighbor algorithm, without inheriting its undesirable time complexity, is to use the Nearest Centroid algorithm. However, because of the unique properties of (most) time series data, the centroid typically does not resemble any of the instances, an unintuitive and underappreciated fact. In this work we show that we can exploit a recent result to allow meaningful averaging of 'warped' times series, and that this result allows us to create ultra-efficient Nearest 'Centroid' classifiers that are at least as accurate as their more lethargic Nearest Neighbor cousins.},
comment = {ICDM 2023 10-year Highest Impact Paper Award},
comment2 = {One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue},
doi = {10.1109/ICDM.2014.27},
keywords = {time series},
related = {scalable-time-series-classifiers},
}

ABSTRACT Recent years have seen significant progress in improving both the efficiency and effectiveness of time series classification. However, because the best solution is typically the Nearest Neighbor algorithm with the relatively expensive Dynamic Time Warping as the distance measure, successful deployments on resource constrained devices remain elusive. Moreover, the recent explosion of interest in wearable devices, which typically have limited computational resources, has created a growing need for very efficient classification algorithms. A commonly used technique to glean the benefits of the Nearest Neighbor algorithm, without inheriting its undesirable time complexity, is to use the Nearest Centroid algorithm. However, because of the unique properties of (most) time series data, the centroid typically does not resemble any of the instances, an unintuitive and underappreciated fact. In this work we show that we can exploit a recent result to allow meaningful averaging of 'warped' times series, and that this result allows us to create ultra-efficient Nearest 'Centroid' classifiers that are at least as accurate as their more lethargic Nearest Neighbor cousins.

One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue
A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data.
Petitjean, F., Allison, L., & Webb, G. I.
Proceedings of the 14th IEEE International Conference on Data Mining, pp. 480-489, 2014.
[Bibtex] [Abstract]

@InProceedings{PetitjeanEtAl14a,
author = {Petitjean, F. and Allison, L. and Webb, G. I.},
booktitle = {Proceedings of the 14th {IEEE} International Conference on Data Mining},
title = {A Statistically Efficient and Scalable Method for Log-Linear Analysis of High-Dimensional Data},
year = {2014},
pages = {480-489},
abstract = {Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of Chi-square tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches -- statistical efficiency -- 2) controls for the risk of false discoveries as well as statistical approaches -- high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer -- computational efficiency.},
comment = {One of nine papers invited to Knowledge and Information Systems journal ICDM-14 special issue},
keywords = {Association Rule Discovery and statistically sound discovery and scalable graphical models and DP140100087, efficent ML},
related = {scalable-graphical-modeling},
url = {http://dx.doi.org/10.1109/ICDM.2014.23},
}

ABSTRACT Log-linear analysis is the primary statistical approach to discovering conditional dependencies between the variables of a dataset. A good log-linear analysis method requires both high precision and statistical efficiency. High precision means that the risk of false discoveries should be kept very low. Statistical efficiency means that the method should discover actual associations with as few samples as possible. Classical approaches to log-linear analysis make use of Chi-square tests to control this balance between quality and complexity. We present an information-theoretic approach to log-linear analysis. We show that our approach 1) requires significantly fewer samples to discover the true associations than statistical approaches – statistical efficiency – 2) controls for the risk of false discoveries as well as statistical approaches – high precision - and 3) can perform the discovery on datasets with hundreds of variables on a standard desktop computer – computational efficiency.