Content-based hierarchical document organization using multi-layer hybrid network and tree-structured features

作者:

Highlights:

摘要

Automatic organizing documents through a hierarchical tree is demanding in many real applications. In this work, we focus on the problem of content-based document organization through a hierarchical tree which can be viewed as a classification problem. We proposed a new document representation to enhance the classification accuracy. We developed a new hybrid neural network model to handle the new document representation. In our document representation, a document is represented by a tree-structure that has a superior capability of encoding document characteristics. Compared to traditional feature representation that encodes only global characteristics of a document, the proposed approach can encode both global and local characteristics of a document through a hierarchical tree. Unlike traditional representation, the tree representation reflects the spatial organizations of words through pages and paragraphs of a document that help to encode better semantics of a document. Processing hierarchical tree is another challenging task in terms of computational complexity. We developed a hybrid neural network model, composed of SOM and MLP, for this task. Experimental results corroborate that our approach is efficient and effective in registering documents into organized tree compared with other approach.

论文关键词:Document classification,Hierarchical organization,Tree-structured features,Self-organizing map,Multi-layer hybrid network

论文评审过程:Available online 11 September 2009.

论文官网地址:https://doi.org/10.1016/j.eswa.2009.09.002