Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels

作者:Olivier de Vel

摘要

In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents.

论文关键词:document categorization, suffix tree, bounded-length spectrum, kernel, support vector machines, Naive Bayes classifier, computer forensics, digital forensics

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10618-005-0037-z