Learning heterogeneous graph embedding for Chinese legal document similarity

摘要

Measuring the similarity between legal documents to find prior documents from a massive collection that are similar to a current document is an essential component in legal assistant systems. This type of system can automatically link related legal documents to ensure that the same situations are treated identically in judicial practice. Most existing methodologies propose text- and citation-based methods to calculate the similarity between legal documents. However, those methods have difficulty capturing the semantics of many legal entities and giving more accurate similarity. The main reason is the lack of legal domain knowledge and citation relations between legal documents. We introduce practical, generic heterogeneous graph representation learning based on a legal heterogeneous knowledge graph to address these challenges. Specifically, we construct a heterogeneous knowledge graph containing legal entities and documents and develop a graph-based embedding model called L-HetGRL. A legal entity can simply be simply a legal-related encyclopedia entry that contains legal-domain knowledge utilized to enhance document representation. L-HetGRL incorporates learning legal document information and external legal domain knowledge in a unified manner by jointly considering heterogeneous content. In addition, we designed a legal case-aware semantic alignment module that effectively combines legal entities and their semantics in documents, thus improving the representation of entities. We conducted comprehensive experiments, including similar case matching and charge prediction, to evaluate the performance of our L-HetGRL on two real-world datasets. As a result, the experimental evaluations demonstrate that L-HetGRL outperforms other competitive baselines. In addition, we present a series of suggestions for document representation in the legal domain, which provide valuable guidelines for follow-up studies.