# README: Anomaly Detection in Graph Data Using Isolation Forest
This documentation supports the Python implementation of a machine learning-based anomaly detection system, particularly tailored for graph streams. Our approach addresses the detection of Advanced Persistent Threats (APTs) by harnessing both structural and temporal characteristics of graph-based data, as described in our recent publication in Applied Intelligence (Megherbi et al., 2024). This method leverages hashing for compact data representation and a dynamic learning model, enabling efficient and incremental anomaly detection with minimal memory usage.
## Overview
The algorithm detailed here innovatively integrates the Isolation Forest technique with hash-based vectorization of graph data to detect anomalies effectively. As graph streams often embody complex and evolving interactions—such as those found in telecommunications or network traffic—our method not only accommodates the voluminous nature of streaming data but also captures subtle changes over time that indicate malicious activities. Empirical results discussed in our publication confirm the effectiveness of this method, allowing for real-time detection that can identify APTs at their inception.
## Dependencies
The script requires Python version 3.6 or newer, along with several external libraries which include:
-**Pandas**: For data manipulation and ingestion.
-**NumPy**: For numerical operations.
-**scikit-learn**: For implementing the Isolation Forest algorithm and various metrics.
## Installation
Prior to executing the script, ensure that the necessary Python libraries are installed. This can be achieved through the following pip command:
```bash
pip install numpy pandas scikit-learn
```
## Usage Instructions
The script is designed to be run from the command line with various parameters allowing for customization of the model's training and testing processes:
### Parameters:
-`--data_path`: Specifies the path to the dataset file.
-`--train_ids`: Path to the file containing identifiers for training graphs.
-`--test_ids`: Path to the file containing identifiers for testing graphs.
-`--vector_size`: Specifies the size of the feature vectors (default 128).
-`--string_size`: Determines the length of the strings used for hashing (default 4).
-`--output`: Designates the path for saving the output file containing detection results.
The input dataset is expected to be in a tab-separated value (TSV) format with six columns: `src_id`, `src_type`, `dst_id`, `dst_type`, `e_type`, and `graph_id`. The file should not contain header rows.
## Output Format
The output is a CSV file with semicolon separators, including the following metrics for each tested graph: AUC (Area Under the ROC Curve), Balanced Accuracy, Average Precision, False Positive Rate, and False Negative Rate.
## Contribution Guidelines
Contributions to enhance or extend the functionality of this script are welcome. Interested contributors are encouraged to fork the repository and submit pull requests for review.
## Licensing
This project and its contents are provided under the MIT License, details of which can be found in the accompanying LICENSE file.