GroupHiGraph

HiGraph: A Large-Scale Hierarchical Graph Dataset

Hierarchical Graph Dataset for Malware Analysis with Function Call Graphs and Control Flow Graphs

Han Chen1Hanchen Wang1Hongmei Chen2Ying Zhang1Lu Qin1Wenjie Zhang3
1University of Technology Sydney, 2Yunnan University, 3University of New South Wales

Interactive Graph Visualization

Explore the hierarchical structure of malware samples through our interactive visualization tool.

Abstract

Graph-based methods have shown great promise in malware analysis, yet the lack of large-scale, hierarchical graph datasets limits further advances in this field. To bridge this gap, we introduceHIGRAPH, a novel, large-scale dataset that models each application as a hierarchical graph: a local Control Flow Graph (CFG) capturing intra-function logic and a global Function Call Graph (FCG) capturing inter-function interactions.

This hierarchical design facilitates the development of robust detection models that are more resilient to obfuscation, model aging, and malware evolution. HIGRAPH contains over 200M control flow graphs and 595K function call graphs, preserving rich semantic and structural information crucial for analyzing sophisticated malware behaviors. We provide an in-depth analysis of HIGRAPH and highlight its potential as a benchmark dataset for advancing hierarchical graph learning in cybersecurity.

Dataset Overview

595K+
Function Call Graphs
200M+
Control Flow Graphs
6.17GB
Compressed Size
11 Years
Time Span (2012-2022)

Hierarchical Graph Structure

HiGraph models each application as a hierarchical graph, preserving both local and global structural information

HiGraph Hierarchical Structure Overview
Program Level
Function Call Graphs (FCG) capturing global program structure and inter-function relationships
Function Level
Control Flow Graphs (CFG) representing detailed intra-function logic and control flow
Malware Analysis
Rich semantic information enabling advanced malware detection and classification

Download Dataset

Access the complete HiGraph dataset through Hugging Face

Hugging FaceDataset SizeSamplesTime PeriodLicense
Dataset Size
6.17GB

Compressed dataset size

Time Span
2012-2022

11 years of samples

License
CC-BY-NC-SA

Creative Commons

Updates

Changelog

Latest updates and improvements to the HiGraph dataset.

May 16, 2025

Initial release of the HiGraph dataset.

HiGraph, a novel, large-scale dataset that models each application as a hierarchical graph, is made publicly available. This initial version includes over 200 million Control Flow Graphs (CFGs) and over 595,000 Function Call Graphs (FCGs).

Future Plans

Continued development and expansion of the HiGraph dataset.

  • Regular updates with new samples and features.
  • Integration of more advanced graph analysis tools.
  • Community contributions and collaborations.