FEATURE ENGINEERING FOR TOPICAL CLUSTERING BASED ON NAMED ENTITY
Conventional clustering researches are focused on the extraction of keywords for word similarity grouping. However, high complexity, low speed, and low accuracy are incurred owing to the computation of too many candidates. To overcome these weaknesses, this paper presents a topical web document clustering model using not only keywords but also named entities such as a person’s name, organization, and location. We compare our proposed model with traditional models experimentally and analyze how different the effects of named entities are according to the characteristics of the document collection. For feature engineering, we adopt word embedding techniques as the collective name for a set of language modeling in natural language processing. In particular, we examine the correlation among topic words of clustered sets according to the concept level of the named entities.
web document clustering, named entity, feature engineering, word embedding.