Learning Data Structure Behaviour from Executions of Pointer Programs
This project concerns the development of techniques to automatically and reliably learn the data structure behaviour in a pointer program by analysing traces of its execution. For example, when considering a program that employs a linked list to implement a queue, we would like to learn the shape of the data structure (singly linked list), its behaviour (only operations consistent with a queue are permitted) and any relevant additional information (such as whether the list uses a header node).
Pointer programs, such as those written in C, are notoriously difficult to comprehend, and this is especially true for legacy or low-level code, e.g., that found in OSs or device drivers. In such situations it is not uncommon to see programmers employ complex usages of pointers, types and memory allocation to achieve the desired behavior or efficiency. These constructs are often used to implement the dynamic data structures of a program, and thus data structures can form a major obstacle in program analysis.
To partially alleviate this obstacle, we present the tools dsOli (Data Structure Operation Location and Identification) and DSI (Data Structure Investigator) designed to automatically identify data structures.
We envisage the output of DSI and dsOli to be of use across a number of domains, including the following:
- Program comprehension, where we provide the user with detailed feedback about data structure usage via a GUI.
- Heap visualization, where we use the relationships between data structures to produce visually intuitive layouts of the heap.
- Formal verification, where we generate program annotations describing data structure use, such as function contracts, to support the verification process.
- Reverse engineering, where we accept binaries as input for the purpose of analyzing complex software such as malware.
Data Structure Investigator (DSI)
Data Structure Investigator (DSI) is our most recent prototype for the automatic identification of dynamic data structures in C programs. It is based on a dynamic analysis that requires source code and is split into an online phase, where a trace is collected by executing an instrumented program, and an offline phase, where that trace is analyzed to identify the data structures. The identification of singly/doubly (cyclic) linked lists, trees and skip lists are all in scope, as well as relationships between data structures such as nesting, sharing and overlay (where one allocated chunk of memory contains the nodes of multiple distinct data structures).
DSI comprises two key contributions: the first is a rich abstraction of program memory that allows, e.g., interpretation of complex list implementations such as those employed by the Linux kernel and built-in support for the handling of custom memory allocators. Turning to the second contribution, a significant challenge posed by data structures is that their key structural characteristics may temporally disappear during manipulation operations, the clearest example of this being insertion to a doubly-linked list. We resolve this by employing an evidence-based approach where we build evidence for the various interpretations of a data structure over its lifetime, and then ultimately select the most viable interpretation.
We have shown that DSI can reliably detect data structures across a range of examples from programs taken from textbooks, to challenging programs taking from the literature and ultimately to real-world programs such as bash and libusb.
For more details on the approach employed by DSI, please consult the slides below:
Data Structure Operation Location and Identification (dsOli)
For a high-level overview of our first prototype tool Data Structure Operation Location and Identification (dsOli), please consult the poster presented at ICPC 2014. For more details please see the TACAS 2013 or ICPC 2014 papers available from the links below.
- G. Lüttgen. Learning Data Structure Behaviour from Executions of Pointer Programs. Deutsche Forschungsgemeinschaft (DFG, LU 1748/4-1), 2014. (Case for Support)
- T. Rupprecht, X. Chen, D. H. White, J. H. Boockmann, G. Lüttgen, and H. Bos. DSIbin: Identifying dynamic data structures in C/C++ binaries. In 32nd Intl. Conf. on Automated Software Engineering (ASE ’17). IEEE/ACM, 2017. Accepted for publication.
- T. Rupprecht, X. Chen, D. White, J. T. Mühlberg, H. Bos and G. Lüttgen. POSTER: Identifying Dynamic Data Structures in Malware. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, (CCS 2016), pp. 1772-1774. ACM, 2016. (DOI: 10.1145/2976749.2989041)
- D. White, T. Rupprecht and G. Lüttgen. DSI: An Evidence-Based Approach to Identify Dynamic Data Structures in C Programs. In The International Symposium on Software Testing and Analysis, (ISSTA 2016), pp. 259-269. ACM, 2016. (DOI: 10.1145/2931037.2931071) (Supplemental Material: Appendix & Examples)
- J. T. Mühlberg, D. White, M. Dodds, G. Lüttgen and F. Piessens. Learning assertions to verify linked-list programs. In Intl. Conf. on Software Engineering and Formal Methods, (SEFM 2015), LNCS 9276, pp. 37-52. Springer, 2015. (DOI: 10.1007/978-3-319-22969-0_3)
- D. White. dsOli: Data Structure Operation Location and Identification. In Intl. Conf. on Program Comprehension (ICPC 2014), pp. 48-52. ACM, 2014. (DOI: 10.1145/2597008.2597800)
- D. White and G. Lüttgen. Identifying Dynamic Data Structures by Learning Evolving Patterns in Memory. In Intl. Conf. on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2013), LNCS 7795, pp. 354-369. Springer, 2013. (DOI: 10.1007/978-3-642-36742-7_25)
- T. Rupprecht, J. H. Boockmann, D. H. White, and G. Lüttgen. DSI: Automated Detection of Dynamic Data Structures in C Programs and Binary Code. In 19th Coll. on Programming Languages and Foundations of Programming (Kolloquium Programmiersprachen, KPS ’17), 2017. Proceedings available online at http://www.kps2017.uni-jena.de/kps2017_proceedings.html.
- D. White, T. Rupprecht and G. Lüttgen. dsOli2: Discovery and Comprehension of Interconnected Lists in C Programs. At Kolloquium Programmiersprachen und Grundlagen der Programmierung (KPS 2015), 2015. (PDF)
- T. Rupprecht. DSI: An Evidence-Based Approach to Identify Dynamic Data Structures in C Programs. VU Amsterdam, Netherlands, 2016.
- DSI/dsOli Day. Please see below for details of the talks. Bamberg, Germany, 2016.
- T. Rupprecht. dsOli2: Discovery and Comprehension of Interconnected Lists in C Programs. At Kolloquium Programmiersprachen und Grundlagen der Programmierung (KPS 2015), Pörtschach, Austria, 2015.
- T. Rupprecht. dsOli2: Discovery and Comprehension of Interconnected Lists in C Programs. KU Keuven, Belgium, 2015.
- D. White. dsOli: Data Structure Operation Location and Identification. At Intl. Conf. on Program Comprehension (ICPC 2014), Hyderabad, India, 2014.
- D. White. Learning a Program's usage of Dynamic Data Structures from Sample Executions. At Approaches and Applications of Inductive Programming, Dagstuhl, Germany, 2013.
- D. White. Comprehension of Data Structures using Evolving Structures in Memory. At Intl. Conf. on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2013), Rome, Italy, 2013.
- D. White. Comprehension of Data Structures using Evolving Structures in Memory. At Advanced Heap Analysis and Verification Workshop, Bamberg, Germany, 2013.
- D. White. dsOli: Data Structure Operation Location and Identification. At Intl. Conf. on Program Comprehension (ICPC 2014), 2014. (PDF)
DSI is open source and available at GitHub (https://github.com/uniba-swt)
In January 2016 we held a small workshop to discuss our findings thus far and to decide on future work; abstracts and slides of the talks given at the workshop can be found below.
DSI Current Version
David White - Slides
Comprehension of C programs containing pointer-based dynamic data structures can be a challenging task. To tackle this challenge we present Data Structure Investigator (DSI), a new dynamic analysis for automated data structure identification that targets C source code. Our technique first applies a novel abstraction on the evolving memory structures observed at runtime to discover data structure building blocks. By analyzing the interconnections between building blocks we are then able to identify, e.g., binary trees, doubly-linked lists, skip lists, and relationships between these such as nesting. Since the true shape of a data structure may be temporarily obscured by manipulation operations, we ensure robustness by first discovering and then reinforcing evidence for data structure observations. We show the utility of our DSI prototype implementation by applying it to both synthetic and real world examples. DSI outputs summarizations of the identified data structures, which will benefit software developers when maintaining (legacy) code.
Force-Directed Layout with DSI-Guided Constraints for Visualizing Points-to Graphs
Kathrin Welzel - Slides
DSI provides useful information about dynamically allocated data structures, however, currently this output is provided as XML. Analyzing the XML by hand is an unintuitive and cumbersome process, and thus, to facilitate the use of DSI for developers, we wish to generate clear and useful visualizations of DSI's output.
We currently only consider the visualization of the memory transformations recognized by DSI, i.e., sequences of points-to graphs. Users have strong expectations of how certain data structures should be represented, e.g., trees are typically displayed top-down from the root and lists are displayed in a line. In many cases good layout algorithms already exist, however, the combinations of data structures that arise due to nesting complicate the layout. Additionally, it is desirable to have a temporally consistent layout, such that the transformations from one program time step to the next can be easily seen.
To handle these complications we employ a force directed layout to derive the general layout, and specify additional constraints for specific data structures, such as "position all vertices of a list in one vertical (or horizontal) line". Then, by seeding force directed layout for the next program time step with the current vertex positions, we hope to achieve the temporal consistency we seek. We believe both techniques in combination will lead to aesthetically pleasing and intuitive graph representations over multiple program time steps.
Functional Unit/Operation Detection via Repetitive Patterns
Linus Dietz - Slides
Analyzing the memory layout and usage of programs during runtime is a sophisticated task that requires automated approaches to extract high-level information such as data structure operations, e.g., the insertion into a list, from the myriads of low-level instructions. Programs typically display highly repetitive behavior, where a single functional unit will be invoked many times; This is particularly true for data structures where manipulation is performed via a limited set of interface functions. The goal of our work is to exploit this repetitive behavior to locate such data structure interface functions.
In contrast to the original version of dsOli, we now observe memorymanipulations from the point of view of each entry pointer, and therefore our input data takes a multidimensional form. We have designed a domain specific parallelized genetic algorithm to locate repetitive functional units featuring a highly optimized fitness evaluation to make the approach feasible. The robustness of the approach to provide good segmentations was tested and proven using synthetic programs. We have now moved on to analyze real world programs such as the UNIX file listing program 'ls'.
Experiences working with Verifast, Predator and Forester
Jan Boockmann - Slides
Shape analysis techniques statically determine whether the runtime heap of a program can satisfy certain properties, and are applicable in areas such as program verification. Forester and Predator are both state-of-the-art shape analyzer tools, which will be discussed with respect to practical experience and presented in a short live demo. VeriFast on the other hand is a verifier for single-threaded and multi-threaded C and Java programs annotated with preconditions and postconditions written in separation logic. Its core components and the typical workflow will be shown using a live demo. It will be discussed in the end how DSI could help to enhance the use of VeriFast.
Verifying Linked-List Programs and Beyond
Tobias Muehlberg - Slides
VeriFast is a sound and modular verification tool for C and Java. It takes as input a number of source files annotated with method contracts written in separation logic, inductive data type and fixpoint definitions, lemma functions and proof steps. The verifier checks that (1) the program does not perform illegal operations such as dividing by zero or illegal memory accesses and (2) that the assumptions described in method contracts hold in each execution. In recent work we have successfully employed the dynamic analysis of dsOli to locate and identify data structure operations in C programs that make use of linked lists, and then use this information to automatically annotate the source code of that program so as to verify the data structure operations for memory safety.
In this talk I will give an overview of ongoing work on case studies and on improving automation in VeriFast. I will focus on challenges that are related to the verification of software components that are meant to execute under the stringent isolation guarantees provided by the embedded protected module architectures. Specifically, I will cover interesting verification objectives for these software components, which reach beyond memory safety and include, e.g., security properties, timing properties and the specification and verification of components' input/output behaviour. The goal of this presentation is to further explore synergy between the dsOli team and ongoing work in formal software verification.
VeriFast and a large number of example programs are available online at: http://www.cs.kuleuven.be/~bartj/verifast
Current Progress on Accepting Object Code as Input to DSI
Thomas Rupprecht - Slides
This talk reports on preliminary work for the transition of DSI from requiring source code as input to also accepting object code. The main motivation for this step are the broader usage scenarios of reverse engineering or monitoring of applications with inaccessible source code. The main challenge when migrating to object code is the loss of information available, especially which regards to the high-level types. Thus, this talk will focus on existing tools and algorithms for type recovery and contrast these results with the information needed by DSI. With this foundation laid, the next steps of the transition to object code can be decided, which includes determining the type recovery strategy and potential changes to DSI's core algorithm.