Resumo:
We are experiencing an increasing development and adoption of statistical learning models (or machine learning) frameworks. Additionally, the vast amounts of data used for training can have unintended effects concerning model adjustment time. In particular, Support Vector Machines (SVMs), which exhibit strong predictive performance, can be computationally intensive and even infeasible when applied to large datasets. This dissertation proposes a method to reduce the training time of a classification SVM model by utilizing two partitioning methods and two sampling approaches. The partitioning methods aim to separate different subsets in feature space, applied to both numerical and categorical variables. Meanwhile, the sampling approaches seek to reduce the size of the training set while maintaining as much representative power from the training sample as possible. The results obtained in applications, whether using simulated or real data, are quite satisfactory, presenting shorter training times and, in some cases, enhanced predictive capabilities when compared to the traditional training approach that uses all observations in a dataset. An important finding was the reduction of the ”curse of dimensionality”effects through the adoption of the proposed method.