Learning Data Structure Behaviour from Executions of Pointer Programs

Jan H. Boockmann & Gerald Luettgen

This project concerns the development of techniques to automatically and reliably learn the data structure behaviour in a pointer program by analysing traces of its execution. For example, when considering a program that employs a linked list to implement a queue, we would like to learn the shape of the data structure (singly linked list), its behaviour (only operations consistent with a queue are permitted) and any relevant additional information (such as whether the list uses a header node).

Pointer programs, such as those written in C, are notoriously difficult to comprehend, and this is especially true for legacy or low-level code, e.g., that found in OSs or device drivers. In such situations it is not uncommon to see programmers employ complex usages of pointers, types and memory allocation to achieve the desired behavior or efficiency. These constructs are often used to implement the dynamic data structures of a program, and thus data structures can form a major obstacle in program analysis.

To partially alleviate this obstacle, we present the tool DSI (Data Structure Investigator) designed to automatically identify data structures.

Applications

We envisage the output of DSI to be of use across a number of domains, including the following:

  • Program comprehension, where we provide the user with detailed feedback about data structure usage via a GUI.
  • Heap visualization, where we use the relationships between data structures to produce visually intuitive layouts of the heap.
  • Formal verification, where we generate program annotations describing data structure use, such as function contracts, to support the verification process.
  • Reverse engineering, where we accept binaries as input for the purpose of analyzing complex software such as malware.

Data Structure Investigator (DSI)

Data Structure Investigator (DSI) is our most recent prototype for the automatic identification of dynamic data structures in C programs. It is based on a dynamic analysis that requires source code and is split into an online phase, where a trace is collected by executing an instrumented program, and an offline phase, where that trace is analyzed to identify the data structures. The identification of singly/doubly (cyclic) linked lists, trees and skip lists are all in scope, as well as relationships between data structures such as nesting, sharing and overlay (where one allocated chunk of memory contains the nodes of multiple distinct data structures).

DSI comprises two key contributions: the first is a rich abstraction of program memory that allows, e.g., interpretation of complex list implementations such as those employed by the Linux kernel and built-in support for the handling of custom memory allocators. Turning to the second contribution, a significant challenge posed by data structures is that their key structural characteristics may temporally disappear during manipulation operations, the clearest example of this being insertion to a doubly-linked list. We resolve this by employing an evidence-based approach where we build evidence for the various interpretations of a data structure over its lifetime, and then ultimately select the most viable interpretation.

We have shown that DSI can reliably detect data structures across a range of examples from programs taken from textbooks, to challenging programs taking from the literature and ultimately to real-world programs such as bash and libusb.

For more details on the approach employed by DSI, please consult the slides below:

Grants

  • G. Lüttgen. Learning Data Structure Behaviour from Executions of Pointer Programs. Deutsche Forschungsgemeinschaft (DFG, LU 1748/4-1), 2014. (Case for Support)
  • TODO: 2nd grant

Publications (Peer-reviewed)

  • J. H. Boockmann, G. Lüttgen, and J. T. Mühlberg: Generating Inductive Shape Predicates for Runtime Checking and Formal Verification. In Proceedings of the 8th Intl. Symposium on Leveraging Applications of Formal Methods, Verification and Validation (ISoLA 2018). Springer, 2018. (DOI: 10.1007/978-3-030-03421-4_5)
  • T. Rupprecht, X. Chen, D. H. White, J. H. Boockmann, G. Lüttgen, and H. Bos. DSIbin: Identifying Dynamic Data Structures in C/C++ Binaries. In 32nd Intl. Conf. on Automated Software Engineering (ASE 2017). IEEE/ACM, 2017. (DOI: 10.1109/ASE.2017.8115646)
  • D. White, T. Rupprecht and G. Lüttgen. DSI: An Evidence-Based Approach to Identify Dynamic Data Structures in C Programs. In The International Symposium on Software Testing and Analysis, (ISSTA 2016), pp. 259-269. ACM, 2016. (DOI: 10.1145/2931037.2931071) (Supplemental Material: Appendix & Examples)
  • J. T. Mühlberg, D. White, M. Dodds, G. Lüttgen and F. Piessens. Learning Assertions to Verify Linked-List Programs. In Intl. Conf. on Software Engineering and Formal Methods, (SEFM 2015), LNCS 9276, pp. 37-52. Springer, 2015. (DOI: 10.1007/978-3-319-22969-0_3)
  • D. White. dsOli: Data Structure Operation Location and Identification. In Intl. Conf. on Program Comprehension (ICPC 2014), pp. 48-52. ACM, 2014. (DOI: 10.1145/2597008.2597800)
  • D. White and G. Lüttgen. Identifying Dynamic Data Structures by Learning Evolving Patterns in Memory. In Intl. Conf. on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2013), LNCS 7795, pp. 354-369. Springer, 2013. (DOI: 10.1007/978-3-642-36742-7_25)

Other Publications

  • T. Rupprecht, J. H. Boockmann, D. H. White, and G. Lüttgen. DSI: Automated Detection of Dynamic Data Structures in C Programs and Binary Code. In 19th Coll. on Programming Languages and Foundations of Programming (Kolloquium Programmiersprachen, KPS 2017), 2017. Proceedings available online at http://www.kps2017.uni-jena.de/kps2017_proceedings.html.
  • D. White, T. Rupprecht and G. Lüttgen. dsOli2: Discovery and Comprehension of Interconnected Lists in C Programs. At Kolloquium Programmiersprachen und Grundlagen der Programmierung (KPS 2015), 2015. (PDF)

Further Talks

  • G. Lüttgen. Automated detection of dynamic data structures in pointer programs. BSR Summer School, 3TU: Big Software on the Run, Berg en Dal, The Netherlands, July 2018.
  • T. Rupprecht. DSI: An Evidence-Based Approach to Identify Dynamic Data Structures in C Programs. VU Amsterdam, Netherlands, 2016.
  • T. Rupprecht. dsOli2: Discovery and Comprehension of Interconnected Lists in C Programs. At Kolloquium Programmiersprachen und Grundlagen der Programmierung (KPS 2015), Pörtschach, Austria, 2015.
  • T. Rupprecht. dsOli2: Discovery and Comprehension of Interconnected Lists in C Programs. KU Keuven, Belgium, 2015.
  • D. White. Learning a Program's usage of Dynamic Data Structures from Sample Executions. At Approaches and Applications of Inductive Programming, Dagstuhl, Germany, 2013.
  • D. White. Comprehension of Data Structures using Evolving Structures in Memory. At Advanced Heap Analysis and Verification Workshop, Bamberg, Germany, 2013.

Posters

  • T. Rupprecht, X. Chen, D. White, J. T. Mühlberg, H. Bos and G. Lüttgen. POSTER: Identifying Dynamic Data Structures in Malware. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, (CCS 2016), pp. 1772-1774. ACM, 2016. (DOI: 10.1145/2976749.2989041)
  • D. White. dsOli: Data Structure Operation Location and Identification. At Intl. Conf. on Program Comprehension (ICPC 2014), 2014. (PDF)

Download

DSI is open source and available at GitHub (https://github.com/uniba-swt)