Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems

作者:

Highlights:

• Machine learning classification algorithms require large training sets.

• The lack of labeled cyber data precludes detection of malicious behavior.

• Generative methods are used to synthesize labeled cyber data.

• Classifiers fit with only synthetic data underperform those fit with only real data.

• Offsetting 85% of real training data with synthetic data did not reduce performance.

摘要

•Machine learning classification algorithms require large training sets.•The lack of labeled cyber data precludes detection of malicious behavior.•Generative methods are used to synthesize labeled cyber data.•Classifiers fit with only synthetic data underperform those fit with only real data.•Offsetting 85% of real training data with synthetic data did not reduce performance.

论文关键词:Generative machine learning,Generative adversarial network,Variational autoencoder,Synthetic data generation,Network intrusion detection

论文评审过程:Received 3 January 2022, Revised 28 May 2022, Accepted 19 June 2022, Available online 27 June 2022, Version of Record 2 July 2022.

论文官网地址:https://doi.org/10.1016/j.eswa.2022.117936