Osman, Osama A.; Wu, Dalei
College of Engineering and Computer Science
University of Tennessee at Chattanooga
Place of Publication
Rare event case data occur at such an infrequent rate that even having high amounts of it can leave researchers starving for more information. There has always existed a tug and pull relationship among rare event case data, where a higher count of entries often leads to a lack of explanatory variables, and vice versa. In the research spectrum of rare event case probability prediction, several methods of data sampling exist to remedy the main issue of rare event case data: a lack of data to collect and learn from. The most effective methods often involve altering the distribution of the training samples in a data set. The least utilized of these methods is negative sampling, where positive entries in a data set are used to generate negative entries. To outline the utility of negative sampling, this work discusses the application of five types of negative sampling on a vehicular accident prediction project, where non-accident records are generated through manipulating the temporal and spatial attributes of existing accident records. Moreover, different methods of data manipulation, including feature selection and different negative to positive data ratios, are used to explore what types of explanatory variables are most important when predicting vehicular accidents. Additionally, two types of predictive models, a Multilayer Perceptron and a Logistic Regression model, are created and directly compared in terms of predictive capability. Ultimately, the best model for predictive performance is heavily dependent on the specific implementation and desired results.
M. S.; A thesis submitted to the faculty of the University of Tennessee at Chattanooga in partial fulfillment of the requirements of the degree of Master of Science.
Machine learning; Sampling (Statistics); Traffic accidents--Mathematical models
vii, 47 leaves
Roland, Jeremy, "How negative sampling provides class balance to rare event case data using a vehicular accident prediction project as a use case scenario" (2020). Masters Theses and Doctoral Dissertations.