A NEW GENETIC ALGORITHM FOR CLUSTERING BINARY DATA WITH APPLICATION TO TRAFFIC ROAD ACCIDENTS IN CHRISTCHURCH
The analysis of traffic road accidents is increasingly important because of the accidents cost and public road safety. The availability of large data sets makes the study of factors that may affect the frequency and severity of accidents viable. We deal with a binary data set of the traffic road accidents recorded in Christchurch, New Zealand, from 2000 to 2009. (50 factors for 26440 records classified in 4 severity levels.) We used cluster analysis to measure the similarity of the factors both on the whole data set and separately for severity levels to outline the association between accident type and factors involved. Several algorithms based on the well known k-means algorithm and variants exist specifically designed for binary data. However they are known to often show dependence on initial values and a tendency to deliver a local optimum as a solution. A novel genetic algorithm is proposed to improve the performance of the incremental k-means algorithm (C. Ordonez, Clustering binary data streams with K-means, ACM SIGMOD Workshop on DMKD, 2003, San Diego, CA [11]). The objective function is based on a few sufficient statistics that may be easily and fast calculated on binary numbers. The results may provide us with an interesting insight into the similarity or dissimilarity between factors and accident severity levels and suggest that while the factors recorded in concurrence with fatal and serious accidents are few and distant each other, at the opposite a large number of similar factors are recorded in concurrence with accidents classified as either minor or non-injured.
binary data, cluster analysis, genetic algorithms, k-means algorithm, road traffic accidents.