New binary regression models using symmetric and asymmetric link functions
Imbalanced data, double Lindley distribution, power and reversal power link functions, maximum likelihood method, predictive performance
Regression models with binary response variables (1 - occurrence of the event of interest or "success'', 0 - non-occurrence of the event of interest or "failure'') have been intensively applied in several areas of knowledge, such as health, finance, industry, among others. Traditionally, the most used model in binary regression has been the logistic regression model. However, it uses the logit link function, which is a symmetric link function and may not be suitable in certain situations, for example, when one of the response variable classes is disproportionate to the other (imbalanced data set). The main aim of this work is to present new binary regression models using symmetric and asymmetric link functions. The parameter estimation of the proposed models (namely, the double Lindley, asymmetric double Lindley, power double Lindley, and reversal power double Lindley binary regression models) is performed with the classical maximum likelihood method. In order to compare and select the "best'' model among the different distributions, information criteria (AIC and BIC) and measures of predictive performance (AUC, balanced accuracy, sensitivity, specificity, positive and negative predictive values, F1-Score, Matthews correlation coefficient, among others) are used. Through the analysis of two real data sets, one on breast cancer, obtained from the University of California, Irvine's (UCI) Machine Learning Repository, and another on a competition promoted by Santander Bank for the Kaggle community, we show that models using the proposed link functions can provide a better fit and predictive ability than models using standard links, such as logit.