Introduction to the DWARF

Introduction to the
DWARF Debugging Format
Michael J. Eager, Eager Consulting
April, 2012
It would be wonderful if we could write programs that were guaranteed to work correctly and never needed to be debugged. Until that halcyon day, the normal pro
gramming cycle is going to involve writing a program, compiling it, executing it, and then the (somewhat) dreaded scourge of debugging it. And then repeat until the pro
gram works as expected. It is possible to debug programs by in
serting code that prints values of selected interesting variables. Indeed, in some situa
tions, such as debugging kernel drivers, this may be the preferred method. There are lowlevel debuggers that allow you to step through the executable program, instruc
tion by instruction, displaying registers and memory contents in binary. But it is much easier to use a sourcelev
el debugger which allows you to step through a program's source, set break
points, print variable values, and perhaps a few other functions such as allowing you to call a function in your program while in the debugger. The problem is how to coordi
nate two completely different programs, the compiler and the debugger, so that the program can be debugged. memory addresses, and binary values which the processor actually understands. After all, the processor really doesn't care whether you used object oriented program
ming, templates, or smart pointers; it only understands a very simple set of operations on a limited number of registers and mem
ory locations containing binary values. As a compiler reads and parses the source of a program, it collects a variety of information about the program, such as the line numbers where a variable or function is declared or used. Semantic analysis ex
tends this information to fill in details such as the types of variables and arguments of functions. Optimizations may move parts of the program around, combine similar pieces, expand inline functions, or remove parts which are unneeded. Finally, code generation takes this internal representa
tion of the program and generates the actu
al machine instructions. Often, there is an
other pass over the machine code to per
form what are called "peephole" optimiza
tions that may further rearrange or modify the code, for example, to eliminate dupli
cate instructions. Allinall, the compiler's task is to take the wellcrafted and understandable source code and convert it into efficient but essen
Translating from
tially unintelligible machine language. The Source to Executable better the compiler achieves the goal of cre
he process of compiling a program ating tight and fast code, the more likely it from humanreadable form into the bi is that the result will be difficult to under
nary form that a processor executes is quite stand. complex, but it essentially involves succes
During this translation process, the sively recasting the source into simpler and compiler collects information about the simpler forms, discarding information at program which will be useful later when each step until, eventually, the result is the the program is debugged. There are two sequence of simple operations, registers, challenges to doing this well. The first is that in the later parts of this process, it may be difficult for the compiler to relate the Michael Eager is Principal Consultant at changes it is making to the program to the Eager Consulting (www.eagercon.com), original source code that the programmer specializing in development tools for wrote. For example, the peephole optimizer embedded systems. He was a member may remove an instruction because it was of PLSIG's DWARF standardization com
able to switch around the order of a test in mittee and has been Chair of the code that was generated by an inline func
DWARF Standards Committee since tion in the instantiation of a C++ template. 1999. Michael can be contacted at By the time it gets its metaphorical hands [email protected].
on the program, the optimizer may have a © Eager Consulting, 2006, 2007, 2012
difficult time connecting its manipulations T
of lowlevel code to the original source which generated it. The second challenge is how to describe the executable program and its relationship to the original source with enough detail to allow a debugger to provide the program
mer useful information. At the same time, the description has to be concise enough so that it does not take up an extreme amount of space or require significant processor time to interpret. This is where the DWARF Debugging Format comes in: it is a compact representation of the relationship between the executable program and the source in a way that is reasonably efficient for a debug
ger to process. The Debugging
Process
W
hen a programmer runs a program under a debugger, there are some common operations which he or she may want to do. The most common of these are setting a breakpoint to stop the debugger at a particular point in the source, either by specifying the line number or a function name. When this breakpoint is hit, the pro
grammer usually would like to display the values of local or global variables, or the ar
guments to the function. Displaying the call stack lets the programmer know how the program arrived at the breakpoint in cases where there are multiple execution paths. After reviewing this information, the programmer can ask the debugger to con
tinue execution of the program under test.
There are a number of additional opera
tions that are useful in debugging. For ex
ample, it may be helpful to be able to step through a program line by line, either en
tering or stepping over called functions. Setting a breakpoint at every instance of a template or inline function can be impor
tant for debugging C++ programs. It can be helpful to stop just before the end of a function so that the return value can be dis
played or changed. Sometimes the pro
grammer may want to bypass execution of a function, returning a known value instead of what the function would have (possibly incorrectly) computed. There are also data related operations that are useful. For example, displaying the type of a variable can avoid having to look up the type in the source files. Dis
playing the value of a variable in different formats, or displaying a memory or register in a specified format is helpful. There are some operations which might be called advanced debugging functions: for example, being able to debug multi
threaded programs or programs stored in readonly memory. One might want a de
bugger (or some other program analysis tool) to keep track of whether certain sec
tions of code had been executed or not. Some debuggers allow the programmer to call functions in the program being tested. In the notsodistant past, debugging pro
grams that had been optimized would have been considered an advanced feature. The task of a debugger is to provide the programmer with a view of the executing program in as natural and understandable fashion as possible, while permitting a wide range of control over its execution. This means that the debugger has to essentially reverse much of the compiler’s carefully crafted transformations, converting the pro
gram’s data and state back into the terms that the programmer originally used in the program’s source.
The challenge of a debugging data for
mat, like DWARF, is to make this possible and even easy.
Debugging Formats
T
here are several debugging formats: stabs, COFF, PECOFF, OMF, IEEE695, and two variants1 of DWARF, to name some common ones. I’m not going to describe these in any detail. The intent here is only to mention them to place the DWARF De
bugging Format in context.
The name stabs comes from symbol ta
ble strings, since the debugging data were originally saved as strings in Unix’s a.out object file’s symbol table. Stabs encodes the information about a program in text strings. Initially quite simple, stabs has evolved over time into a quite complex, oc
casionally cryptic and lessthanconsistent debugging format. Stabs is not standard
ized nor well documented2. Sun Microsys
tems has made a number of extensions to stabs. GCC has made other extensions, 1
DWARF Version 1 is significantly different from Versions 2 and later.
while attempting to reverse engineer the Sun extensions. Nonetheless, stabs is still widely used.
COFF stands for Common Object File Format and originated with Unix System V Release 3. Rudimentary debugging infor
mation was defined with the COFF format, but since COFF includes support for named sections, a variety of different debugging formats such as stabs have been used with COFF. The most significant problem with COFF is that despite the Common in its name, it isn’t the same in each architecture which uses the format. There are many variations in COFF, including XCOFF (used on IBM RS/6000), ECOFF (used on MIPS and Alpha), and Windows PECOFF. Docu
mentation of these variants is available to varying degrees but neither the object mod
ule format nor the debugging information is standardized. DWARF 1 ─ Unix SVR4 sdb
and PLSIG
D
WARF3 was developed by Brian Rus
sell, Ph.D., at Bell Labs in 1988 for use with the C compiler and sdb debugger in Unix System V Release 4 (SVR4). The Pro
gramming Languages Special Interest Group (PLSIG), part of Unix International (UI), documented the DWARF generated by SVR4 as DWARF Version 1 in 1992. Al
though the original DWARF had several clear shortcomings, most notably that it was not very compact, the PLSIG decided to standardize the SVR4 format with only minimal modification. It was widely adopt
ed within the embedded sector where it continues to be used today, especially for PECOFF is the object module format small processors. used by Microsoft Windows beginning with Windows 95. It is based on the COFF for DWARF 2 ─ PLSIG
mat and contains both COFF debugging he PLSIG continued to develop and data and Microsoft’s own proprietary Code
document extensions to DWARF to ad
View or CV4 debugging data format. Docu
dress several issues, the most important of mentation on the debugging format is both which was to reduce the size of debugging sketchy and difficult to obtain.
data that were generated. There were also OMF stands for Object Module Format additions to support new languages such as and is the object file format used in CP/M, the upandcoming C++ language. DWARF DOS and OS/2 systems, as well as a small Version 2 was released as a draft standard number of embedded systems. OMF de in 1993. fines public name and line number infor
In an example of the domino theory in mation for debuggers and can also contain action, shortly after PLSIG released the Microsoft CV, IBM PM, or AIX format de
draft standard, fatal flaws were discovered bugging data. OMF only provides the most in Motorola's 88000 microprocessor. Mo
rudimentary support for debuggers.
torola pulled the plug on the processor, IEEE695 is a standard object file and which in turn resulted in the demise of debugging format developed jointly by Mi Open88, a consortium of companies that crotec Research and HP in the late 1980’s were developing computers using the for embedded environments. It became an 88000. Open88 in turn was a supporter of IEEE standard in 1990. It is a very flexible Unix International, sponsor of PLSIG, which specification, intended to be usable with al resulted in UI being disbanded. When UI most any machine architecture. The de folded, all that remained of the PLSIG was bugging format is block structured, which a mailing list and a variety of ftp sites that corresponds to the organization of the had various versions of the DWARF 2 draft source better than other formats. Although standard. A final standard was never re
it is an IEEE standard, in many ways IEEE leased.
695 is more like the proprietary formats. Since Unix International had disap
Although the original standard is readily peared and PLSIG disbanded, several orga
available from IEEE, Microtec Research nizations independently decided to extend made extensions to support C++ and opti
DWARF 1 and 2. Some of these extensions mized code which are poorly documented. were specific to a single architecture, but The IEEE standard was never revised to in
others might be applicable to any architec
corporate the Microtec Research or other ture. Unfortunately, the different organiza
changes. Despite being an IEEE standard, tions didn’t work together on these exten
it's use is limited to a few small processors. sions. Documentation on the extensions is T
2
3
In 1992, the author wrote an extensive docu
ment describing the stabs generated by Sun Mi
crosytems' compilers. Unfortunately, it was never widely distributed.
Introduction to the DWARF Debugging Format
A Brief History of
DWARF
The name DWARF is something of a pun, since it was developed along with the ELF object file format. The name is an acronym for “Debugging With Arbitrary Record Formats”.
2
Michael J. Eager
generally spotty or difficult to obtain. Or as a GCC developer might suggest, tongue firmly in cheek, the extensions were well documented: all you have to do is read the compiler source code. DWARF was well on its way to following COFF and becoming a collection of divergent implementations rather than being an industry standard.
DWARF 3 ─ Free Standards
Group
D
espite several online discussions about DWARF on the PLSIG email list (which survived under X/Open [later Open Group] sponsorship after UI’s demise), there was little impetus to revise (or even finalize) the document until the end of 1999. At that time, there was interest in extending DWARF to have better support for the HP/Intel IA64 architecture as well as better documentation of the ABI used by C++ programs. These two efforts separat
ed, and the author took over as Chair for the revived DWARF Committee. Following more than 18 months of de
velopment work and creation of a draft of the DWARF 3 specification, the standard
ization effort hit what might be called a soft patch. The committee (and this author, in particular) wanted to insure that the DWARF standard was readily available and to avoid the possible divergence caused by multiple sources for the standard. The DWARF Committee became the DWARF Workgroup of the Free Standards Group in 2003. Active development and clarification of the DWARF 3 Standard resumed early in 2005 with the goal to resolve any open is
sues in the standard. A public review draft was released to solicit public comments in October and the final version of the DWARF 3 Standard was released in Decem
ber, 2005. DWARF 4 ─ DWARF
Debugging Format Committee
After the Free Standards Group merged with Open Source Development Labs (OSDL) in 2007 to form the Linux Founda
tion, the DWARF Committee returned to in
dependent status and created its own web site at dwarfstd.org. Work began on Ver
sion 4 of the DWARF in 2007. This version clarified DWARF expressions, added sup
port for VLIW architectures, improved lan
guage support, generalized support for packed data, added a new technique for compressing the debug data by eliminating duplicate type descriptions, and added sup
port for profilebased compiler optimiza
tions, as well as extensive editing of the documentation. The DWARF Version 4 Introduction to the DWARF Debugging Format
Standard was released in June, 2010, fol
lowing a public review. Work on DWARF Version 5 started in February, 2012. This version is expected to be completed in 2014. DWARF Overview4
M
ost modern programming languages are block structured: each entity (a class definition or a function, for example) is contained within another entity. Each file in a C program may contain multiple data definitions, multiple variable defini
tions, and multiple functions. Within each C function there may be several data defini
tions followed by executable statements. A statement may be a compound statement that in turn can contain data definitions and executable statements. This creates lex
ical scopes, where names are known only within the scope in which they are defined. To find the definition of a particular symbol in a program, you first look in the current scope, then in successive enclosing scopes until you find the symbol. There may be multiple definitions of the same name in different scopes. Compilers very naturally represent a program internally as a tree. While DWARF is most commonly asso
ciated with the ELF object file format, it is independent of the object file format. It can and has been used with other object file formats. All that is necessary is that the different data sections that make up the DWARF data be identifiable in the object file or executable. DWARF does not dupli
cate information that is contained in the object file, such as identifying the processor architecture or whether the file is written in bigendian or littleendian format.
Debugging
Information Entry
(DIE)
Tags and Attributes
T
he basic descriptive entity in DWARF is the Debugging Information Entry (DIE). A DIE has a tag, which specifies what the DIE describes and a list of at
tributes which fill in details and further de
scribes the entity. A DIE (except for the top
most) is contained in or owned by a parent DIE and may have sibling DIEs or children DIEs. Attributes may contain a variety of values: constants (such as a function DWARF follows this model in that it is name), variables (such as the start address also block structured. Each descriptive enti for a function), or references to another ty in DWARF (except for the topmost entry DIE (such as for the type of a function’s re
which describes the source file) is con turn value).
tained within a parent entry and may con
Figure 1 shows C's classic hello.c
tain children entities. If a node contains multiple entities, they are all siblings, relat program with a simplified graphical repre
ed to each other. The DWARF description sentation of its DWARF description. The of a program is a tree structure, similar to topmost DIE represents the compilation the compiler’s internal tree, where each unit. It has two “children”, the first is the node can have children or siblings. The DIE describing main and the second de
nodes may represent types, variables, or scribing the base type int which is the type functions. This is a compact format where of the value returned by main. The sub
only the information that is needed to de program DIE is a child of the compilation scribe an aspect of a program is provided. unit DIE, while the base type DIE is refer
The format is extensible in a uniform fash enced by the Type attribute in the subpro
ion, so that a debugger can recognize and gram DIE. We also talk about a DIE “own
ignore an extension, even if it might not ing” or “containing” the children DIEs.
understand its meaning. (This is much bet
ter than the situation with most other de Types of DIEs
bugging formats where the debugger gets IEs can be split into two general types. fatally confused attempting to read unrec
Those that describe data including ognized data.) DWARF is also designed to data types and those that describe functions be extensible to describe virtually any pro
cedural programming language on any ma and other executable code. chine architecture, rather than being bound to only describing one language or one ver Describing Data and
sion of a language on a limited range of ar
Types
chitectures.
ost programming languages have so
phisticated descriptions of data. There are a number of builtin data types, 4
In the remainder of this paper, we will be dis pointers, various data structures, and usual
cussing DWARF Version 2 and later versions. ly ways of creating new data types. Since Unless otherwise noted, all descriptions apply to DWARF is intended to be used with a vari
DWARF Versions 2 through 4. D
M
3
Michael J. Eager
used, possibly even within the same pro
gram. Figure 2a shows the DIE which de
scribes int on a typical 32bit processor. The attributes specify the name (int), an encoding (signed binary integer), and the size in bytes (4). Figure 2b shows a similar definition of int on a 16bit processor. (In Figure 2, we use the tag and attribute names defined in the DWARF standard, rather than the more informal names used in Figure 1. The names of tags are all pre
fixed with DW_TAG and the names of at
tributes are prefixed with DW_AT.)
hello.c:
1: int main()
2: {
3:
printf("Hello World!\n");
4:
return 0;
5: }
DIE – Compilation Unit
Dir = /home/dwarf/examples
Name = hello.c
LowPC = 0x0
HighPC = 0x2b
Producer = GCC
DIE – Subprogram
Name = main
File = hello.c
Line = 2
Type = int
LowPC = 0x0
HighPC = 0x2b
External = yes
DIE – Base Type
Name = int
ByteSize = 4
Encoding = signed
integer
Figure 1. Graphical representation of DWARF data
ety of languages, it abstracts out the basics and provides a representation that can be used for all supported language. The prima
ry types, built directly on the hardware, are the base types. Other data types are con
structed as collections or compositions of these base types.
which can hold integer values between 0 and 100. Pascal doesn't specify how this should be implemented. One compiler might implement this as a single byte, an
other might use a 16bit integer, a third might implement all integer types as 32bit values no matter how they are defined. The base types allow the compiler to describe almost any mapping between a programming language scalar type and how it is actually implemented on the pro
cessor. Figure 3 describes a 16bit integer value that is stored in the upper 16 bits of a four byte word. In this base type, there is a bit size attribute that specifies that the val
ue is 16 bits wide and an offset from the highorder bit of zero5. The DWARF base types allow a number of different encodings to be described, in
cluding address, character, fixed point, floating point, and packed decimal, in addi
tion to binary integers. There is still a little ambiguity remaining: for example, the ac
tual encoding for a floating point number is not specified; this is determined by the en
coding that the hardware actually supports. In a processor which supports both 32bit and 64bit floating point values following the IEEE754 standard, the encodings rep
resented by “float” are different depending on the size of the value.
With DWARF Version 1 and other debugging formats, the very programming language defines compiler and debugger are sup
DW_TAG_base_type
DW_AT_name = word
several basic scalar data types. For ex posed to share a common under
DW_AT_byte_size = 4
ample, both C and Java define int and dou standing about whether an int is DW_AT_bit_size = 16
16, 32, or even 64 bits. This be
ble. While Java provides a complete defini
DW_AT_bit_offset = 0
tion for these types, C only specifies some comes awkward when the same DW_AT_encoding = signed
general characteristics, allowing the com hardware can support different plier to pick the actual specifications that size integers or when different Figure 3. 16bit word type stored in the top 16
best fit the target processor. Some lan compilers make different imple
bits of a 32bit word.
guages, like Pascal, allow new base types to mentation decisions for the same be defined, for example, an integer type target processor. These assump
tions, often undocu
Type Composition
mented, make it difficult to have DW_TAG_base_type
named variable is described by a DIE compatibility between different DW_AT_name = int
which has a variety of attributes, one compilers or debuggers, or even DW_AT_byte_size = 4
DW_AT_encoding = signed
between different versions of the of which is a reference to a type definition. Figure 4 describes an integer variable same tools. named x. (For the moment we will ignore Figure 2a. int base type on 32bit processor.
DWARF base types provide the other information that is usually con
the lowest level mapping be tained in a DIE describing a variable.) tween the simple data types and DW_TAG_base_type
The base type for int describes it as a how they are implemented on DW_AT_name = int
signed binary integer occupying four bytes. DW_AT_byte_size = 2
the target machine's hardware. DW_AT_encoding = signed
This makes the definition of int 5
explicit for both Java and C and This is a reallife example taken from an imple
Figure 2b. int base type on 16bit processor
allows different definitions to be mentation of Pascal that passed 16bit integers in Base Types
E
A
the top half of a word on the stack.
Introduction to the DWARF Debugging Format
4
Michael J. Eager
The DW_TAG_variable DIE for x gives its name and a type attribute, which refers to the base type DIE. For clarity, the DIEs are labeled sequentially in the this and follow
ing examples; in the actual DWARF data, a reference to a DIE is the offset from the start of the compilation unit where the DIE can be found. References can be to previ
ously defined DIEs, as in Figure 4, or to DIEs which are defined later. Once we have created a base type DIE for int, any variable in the same compilation can refer
ence the same DIE6. stored in column major order (as in Fortan) or in row major order (as in C or C++). The index for the array is represented by a sub
range type that gives the lower and upper bounds of each dimension. This allows DWARF to describe both C style arrays, which always have zero as the lowest in
dex, as well as arrays in Pascal or Ada, which can have any value for the low and high bounds. Structures, Classes,
Unions, and
Interfaces
vate, or protected. These are described with the accessibility attribute. C and C++ allow bit fields as class members that are not simple variables. These are described with bit offset from the start of the class instance to the leftmost bit of the bit field and bit size that says how many bits the member occupies. Variables
V
ariables are generally pretty simple. They have a name which represents a chunk of memory (or register) that can DWARF uses the base types to construct contain some kind of a value. The kind of other data type definitions by composition. values that the variable can contain, as well A new type is created as a modification of Most languages allow the programmer another type. For example, Figure 5 shows to group data together into structures as restrictions on how it can be modified a pointer to an int on our typical 32bit ma (called struct in C and C++, class in C++, (e.g., whether it is const) are described by chine. This DIE defines a pointer type, spec and record in Pascal). Each of the compo the type of the variable. ifies that its size is four bytes, and in turn nents of the structure generally has a What distinguishes a variable is where references the int base type. Other DIEs de
unique name and may have a its value is stored and its scope. The scope different type, and each occupies of a variable defines where the variable <1>: DW_TAG_base_type
its own space. C and C++ have known within the program and is, to some DW_AT_name = int
the union and Pascal has the degree, determined by where the variable is DW_AT_byte_size = 4
variant record that are similar to declared. In C, variables declared within a DW_AT_encoding = signed
a structure except that the component occupy <2>: DW_TAG_variable
<1>: DW_TAG_variable
the same memory lo
DW_AT_name = x
DW_AT_name = argv
DW_AT_type = <1>
cations. The Java in
DW_AT_type = <2>
terface has a subset of Figure 4. DWARF description of “int x”.
the properties of a C+
<2>: DW_TAG_pointer_type
+ class, since it may DW_AT_byte_size = 4
only have abstract DW_AT_type = <3>
methods and constant <1>: DW_TAG_variable
<3>: DW_TAG_pointer_type
data members. DW_AT_name = px
DW_AT_type = <2>
Although each lan
guage has its own ter
<2>: DW_TAG_pointer_type
minology (C++ calls DW_AT_byte_size = 4
the components of a DW_AT_type = <3>
class members while Pascal calls them <3>: DW_TAG_base_type
DW_AT_name = int
fields) the underlying DW_AT_byte_size = 4
organization can be DW_AT_encoding = signed
described in DWARF. True to its heritage, Figure 5. DWARF description of “int *px”.
DWARF uses the C/C+
+/Java terminology and has DIEs which de
scribe the const or volatile attributes, C++ scribe
struct,
union, class, and interface. reference type, or C restrict types. These type DIEs can be chained together to de We'll describe the class DIE here, but the scribe more complex data types, such as others have essentially the same organiza
“const char **argv” which is described tion. in Figure 6.
Array
A
rray types are described by a DIE which defines whether the data is 6
Some compilers define a common set of type definitions at the start of every compilation unit. Others only generate the definitions for the types which are actually referenced in the program. Either is valid.
Introduction to the DWARF Debugging Format
DW_AT_byte_size = 4
DW_AT_type = <4>
<4>: DW_TAG_const_type
DW_AT_type = <5>
<5>: DW_TAG_base_type
DW_AT_name = char
DW_AT_byte_size = 1
DW_AT_encoding = unsigned
Figure 6. DWARF description of “const char **argv”.
function or block have function or block scope. Those declared outside a function have either global or file scope. This allows different variables with the same name to be defined in different files without con
flicting. It also allows different functions or The DIE for a class is the parent of the compilations to reference the same vari
DIEs which describe each of the class's able. DWARF documents where the vari
members. Each class has a name and possi
able is declared in the source file with a bly other attributes. If the size of an in
(file, line, column) triplet. stance is known at compile time, then it will have a byte size attribute. Each of these DWARF splits variables into three cate
descriptions looks very much like the de gories: constants, formal parameters, and scription of a simple variable, although variables. A constant is used with languages there may be some additional attributes. that have true named constants as part of For example, C++ allows the programmer the language, such as Ada parameters. (C to specify whether a member is public, pri
5
Michael J. Eager
does not have constants as part of the lan
guage. Declaring a variable const just says that you cannot modify the variable with
out using an explicit cast.) A formal param
eter represents values passed to a function. We'll come back to that a bit later. adding a fixed offset to a frame pointer. In other cases, the variable may be stored in a register. Other variables may require some
what more complicated computations to lo
cate the data. A variable that is a member of a C++ class may require more complex computations to determine the location of Some languages, like C or C++ (but not the base class within a derived class. Pascal), allow a variable to be declared without defining it. This implies that there should be a real definition of the variable Location Expressions
somewhere else, hopefully somewhere that WARF provides a very general scheme the compiler or debugger can find. A DIE to describe how to locate the data rep
describing a variable declaration provides a resented by a variable. A DWARF location description of the variable without actually expression contains a sequence of opera
telling the debugger where it is. tions which tell a debugger how to locate D
Most variables have a location attribute the data. Figure 7 shows DIEs for three that describes where the variable is stored. variables named a, b, and c. Variable a has In the simplest of cases, a variable is stored a fixed location in memory, variable b is in register 0, and variable c is at offset –12 within the current fig7.c:
function's stack frame. Although 1: int a;
a was declared first, the DIE to 2: void foo()
describe it is generated later, af
3: {
ter all functions. The actual lo
4:
register int b;
5:
int c;
cation of a will be filled in by 6: }
the linker.
<1>:
<2>:
<3>:
<4>:
<5>:
DW_TAG_subprogram
DW_AT_name = foo
DW_TAG_variable
DW_AT_name = b
DW_AT_type = <4>
DW_AT_location = (DW_OP_reg0)
DW_TAG_variable
DW_AT_name = c
DW_AT_type = <4>
DW_AT_location =
(DW_OP_fbreg: -12)
DW_TAG_base_type
DW_AT_name = int
DW_AT_byte_size = 4
DW_AT_encoding = signed
DW_TAG_variable
DW_AT_name = a
DW_AT_type = <4>
DW_AT_external = 1
DW_AT_location = (DW_OP_addr: 0)
The DWARF location expres
sion can contain a sequence of operators and values that are evaluated by a simple stack ma
chine. This can be an arbitrarily complex computation, with a wide range of arithmetic opera
tions, tests and branches within the expression, calls to evaluate other location expressions, and accesses to the processor's mem
ory or registers. There are even operations used to describe data which is split up and stored in different locations, such as a structure where some data are stored in memory and some are stored in registers. Figure 7. DWARF description of variables
a, b, and c.
Although this great flexibili
ty is seldom used in practice, the location expression should allow the location of a variable's 7
in memory and has a fixed address . But data to be described no matter how com
many variables, such as those declared plex the language definition or how clever within a C function, are dynamically allo
the compiler's optimizations.
cated and locating them requires some (usually simple) computation. For example, a local variable may be allocated on the Describing
stack, and locating it may be as simple as Executable Code
7
Well, maybe not a fixed address, but one that is a fixed offset from where the executable is load
ed. The loader relocates references to addresses within an executable so that at runtime the loca
tion attribute contains the actual memory ad
dress. In an object file, the location attribute is the offset, along with an appropriate relocation table entry. Introduction to the DWARF Debugging Format
Functions and Subprograms
D
DIE. This DIE has a name, a source location triplet, and an attribute which indicates whether the subprogram is external, that is, visible outside the current compilation. A subprogram DIE has attributes that give the low and high memory addresses that the subprogram occupies, if it is con
tiguous, or a list of memory ranges if the function does not occupy a contiguous set of memory addresses. The low PC address is assumed to be the entry point for the routine unless another one is explicitly specified. The value that a function returns is giv
en by the type attribute. Subroutines that do not return values (like C void functions) do not have this attribute. DWARF doesn't describe the calling conventions for a func
tion; that is defined in the Application Bina
ry Interface (ABI) for the particular archi
tecture. There may be attributes that help a debugger to locate the subprogram's data or to find the current subprogram's caller. The return address attribute is a location ex
pression that specifies where the address of the caller is stored. The frame base attribute is a location expression that computes the address of the stack frame for the function. These are useful since some of the most common optimizations that a compiler might do are to eliminate instructions that explicitly save the return address or frame pointer. The subprogram DIE owns DIEs that de
scribe the subprogram. The parameters that may be passed to a function are represent
ed by variable DIEs which have the variable parameter attribute. If the parameter is op
tional or has a default value, these are rep
resented by attributes. The DIEs for the pa
rameters are in the same order as the argu
ment list for the function, but there may be additional DIEs interspersed, for example, to define types used by the parameters. A function may define variables that may be local or global. The DIEs for these variables follow the parameter DIEs. Many languages allow nesting of lexical blocks. These are represented by lexical block DIEs which in turn, may own variable DIEs or nested lexical block DIEs. Here is a somewhat longer example. Figure 8a shows the source for strndup.c, a function in gcc that dupli
cates a string. Figure 8b lists the DWARF generated for this file. As in previous exam
ples, the source line information and the lo
cation attributes are not shown. WARF treats functions that return val
ues and subroutines that do not as In Figure 8b, DIE <2> shows the defi
variations of the same thing. Drifting slight
ly away from its roots in C terminology, nition of size_t which is a typdef of unDWARF describes both with a subprogram signed int. This allows a debugger to 6
Michael J. Eager
particular memory address. If the compila
tion unit is not contiguous, then a list of the ost interesting programs consists of memory addresses that the code occupies is more than a single file. Each source provided by the compiler and linker. file that makes up a program is compiled independently and then linked together The Compilation Unit DIE is the parent with system libraries to make up the pro of all of the DIEs that describe the compila
gram. DWARF calls each separately com tion unit. Generally, the first DIEs will de
piled source file a compilation unit.
scribe data types, followed by global data, then the functions that make up the source The DWARF data for each compilation file. The DIEs for variables and functions unit starts with a Compilation Unit DIE. are in the same order in which they appear This DIE contains general information in the source file. about the compilation, including the direc
tory and name of the source file, the pro Data encoding
strndup.c:
gramming language onceptually, the DWARF data that de
1: #include "ansidecl.h"
used, a string which scribes a program is a tree. Each DIE 2: #include <stddef.h>
identifies the produc may have a sibling and maybe several chil
3:
er of the DWARF dren DIEs. Each of the DIEs has a type 4: extern size_t strlen (const char*);
5: extern PTR malloc (size_t);
data, and offsets into (called its TAG) and a number of attributes. 6: extern PTR memcpy (PTR, const PTR, size_t);
the DWARF data sec Each attributes is represented by a attribute 7:
tions to help locate type and a value. Unfortunately, this is not 8: char *
the line number and a very dense encoding. Without compres
9: strndup (const char *s, size_t n)
macro information. sion, the DWARF data is unwieldy. 10: {
display the type of formal argument n as a size_t, while displaying its value as an unsigned integer. DIE <5> describes the function strndup. This has a pointer to its sibling, DIE <10>; all of the following DIEs are children of the Subprogram DIE. The function returns a pointer to char, de
scribed in DIE <10>. DIE <5> also de
scribes the subroutine as external and pro
totyped and gives the low and high PC val
ues for the routine. The formal parameters and local variables of the routine are de
scribed in DIEs <6> to <9>.
Compilation Unit
M
C
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
char *result;
size_t len = strlen (s);
if (n < len)
len = n;
result = (char *) malloc (len + 1);
if (!result)
return 0;
result[len] = '\0';
return (char *) memcpy (result, s, len);
}
Figure 8a. Source for strndup.c.
<1>: DW_TAG_base_type
DW_AT_name = int
DW_AT_byte_size = 4
DW_AT_encoding = signed
<2>: DW_TAG_typedef
DW_AT_name = size_t
DW_AT_type = <3>
<3>: DW_TAG_base_type
DW_AT_name = unsigned int
DW_AT_byte_size = 4
DW_AT_encoding = unsigned
<4>: DW_TAG_base_type
DW_AT_name = long int
DW_AT_byte_size = 4
DW_AT_encoding = signed
<5>: DW_TAG_subprogram
DW_AT_sibling = <10>
DW_AT_external = 1
DW_AT_name = strndup
DW_AT_prototyped = 1
DW_AT_type = <10>
DW_AT_low_pc = 0
DW_AT_high_pc = 0x7b
<6>: DW_TAG_formal_parameter
DW_AT_name = s
DW_AT_type = <12>
DW_AT_location =
(DW_OP_fbreg: 0)
If the compilation unit is contiguous (i.e., it is loaded into memory in one piece) then there are values for the low and high memory addresses for the unit. This makes it easier for a debugger to identify which compilation unit cre
ated the code at a <7>: DW_TAG_formal_parameter
DW_AT_name = n
DW_AT_type = <2>
DW_AT_location =
(DW_OP_fbreg: 4)
<8>: DW_TAG_variable
DW_AT_name = result
DW_AT_type = <10>
DW_AT_location =
(DW_OP_fbreg: -28)
<9>: DW_TAG_variable
DW_AT_name = len
DW_AT_type = <2>
DW_AT_location =
(DW_OP_fbreg: -24)
<10>: DW_TAG_pointer_type
DW_AT_byte_size = 4
DW_AT_type = <11>
<11>: DW_TAG_base_type
DW_AT_name = char
DW_AT_byte_size = 1
DW_AT_encoding =
signed char
<12>: DW_TAG_pointer_type
DW_AT_byte_size = 4
DW_AT_type = <13>
<13>: DW_TAG_const_type
DW_AT_type = <11>
Figure 8b. DWARF description for strndup.c.
Introduction to the DWARF Debugging Format
7
DWARF offers several ways to reduce the size of the data which needs to be saved with the object file. The first is to "flatten" the tree by saving it in prefix order. Each type of DIE is defined to either have chil
dren or not. If the DIE cannot have chil
dren, the next DIE is its sibling. If the DIE can have children, then the next DIE is its first child. The remaining children are rep
resented as the siblings of this first child. This way, links to the sibling or child DIEs can be eliminated. If the compiler writer thinks that it might be useful to be able to jump from one DIE to its sibling without stepping through each of its children DIEs (for example, to jump to the next function in a compilation) then a sibling attribute can be added to the DIE. A second scheme to compress the data is to use abbreviations. Although DWARF allows great flexibility in which DIEs and attributes it may generate, most compilers only generate a limited set of DIEs, all of which have the same set of attributes. In
stead of storing the value of the TAG and the attributevalue pairs, only an index into a table of abbreviations is stored, followed by the attribute codes. Each abbreviation gives the TAG value, a flag indicating whether the DIE has children, and a list of attributes with the type of value it expects. Figure 9 shows the abbreviation for the for
mal parameter DIE used in Figure 8b. DIE <6> in Figure 8 is actually encoded as shown8. This is a significant reduction in the amount of data that needs to be saved at some expense in added complexity. 8
The encoded entry also includes the file and line values which are not shown in Fig. 8b. Michael J. Eager
Abbrev 5:
DW_TAG_formal_parameter
[no children]
DW_AT_name
DW_FORM_string
DW_AT_decl_file
DW_FORM_data1
DW_AT_decl_line
DW_FORM_data1
DW_AT_type
DW_FORM_ref4
DW_AT_location
DW_FORM_block1
┌────────────────────────────────
│
┌────────────────────────────
│
│
┌────────────────────────
│
│
│ ┌─────────────────────
│
│
│ │
┌────────────────
│
│
│ │
│
┌─────────
│
│
│ │
│
│
┌─────
abbreviation 5
”s”
file 1
line 41
type DIE offset
location (fbreg + 0)
terminating NUL
ment, the end of the function prolog, or the start of the function epilog. A set of special opcodes combine the most common opera
tions (incrementing the memory address and either incrementing or decrementing the source line number) into a single op
code. Finally, if a row of the line number ta
ble has the same source triplet as the previ
ous row, then no instructions are generated for this row in the line number program. Figure 10 lists the line number program for strndup.c. Notice that only the machine 05 7300 01 29 0000010c 9100 00
addresses that represent the beginning in
Figure 9. Abbreviation entry and encoded form.
struction of a statement are stored. The compiler did not identify the basic blocks in this code, the end of the prolog or the start Less commonly used are features of arguments to a function have been loaded of the epilog to the function. This table is DWARF Version 3 and 4 which allow refer or before the function returns. Some pro encoded in just 31 bytes in the line number ences from one compilation unit to the cessors can execute more than one instruc program. DWARF data stored in another compilation tion set, so there is another column that in
unit or in a shared library. Many compilers dicates which generate the same abbreviation table and set is stored at base types for every compilation, indepen the specified dent of whether the compilation actually machine loca
uses all of the abbreviations or types. These tion. Address File Line Col Stmt Block End Prolog Epilog ISA
can be saved in a shared library and refer
As you enced by each compilation unit, rather than might imag
0x0
0
42
0
yes
no
no
no
no
0
being duplicated in each. ine, if this ta
0x9
0
44
0
yes
no
no
no
no
0
ble were 0x1a
0
44
0
yes
no
no
no
no
0
Other DWARF Data
stored with 0x24
0
46
0
yes
no
no
no
no
0
0x2c
0
47
0
yes
no
no
no
no
0
one row for 0x32
0
49
0
yes
no
no
no
no
0
each
machine Line Number Table
0x41
0
50
0
yes
no
no
no
no
0
instruction, it he DWARF line table contains the map would be 0x47
0
51
0
yes
no
no
no
no
0
ping between memory addresses that huge. DWARF 0x50
0
53
0
yes
no
no
no
no
0
0x59
0
54
0
yes
no
no
no
no
0
contain the executable code of a program compresses 0x6a
0
54
0
yes
no
no
no
no
0
and the source lines that correspond to this data by 0x73
0
55
0
yes
no
no
no
no
0
these addresses. In the simplest form, this encoding it as 0x7b
0
56
0
yes
no
yes
no
no
0
can be looked at as a matrix with one col sequence of File 0: strndup.c
umn containing the memory addresses and instructions File 1: stddef.h
another column containing the source called a line triplet (file, line, and column) for that ad number pro
dress. If you want to set a breakpoint at a gram9. These Figure 10. Line Number Table for strndup.c.
particular line, the table gives you the instructions memory address to store the breakpoint in are interpret
struction. Conversely, if your program has a ed by a simple finite state machine to recre Macro Information
fault (say, using a bad pointer) at some lo ate the complete line number table. ost debuggers have a very difficult cation in memory, you can look for the time displaying and debugging code The finite state machine is initialized source line that is closest to the memory which has macros. The user sees the origi
with a set of default values. Each row in the address. nal source file, with the macros, while the line number table is generated by executing DWARF has extended this with added one or more of the opcodes of the line code corresponds to whatever the macros columns to convey additional information number program. The opcodes are general generated. about a program. As a compiler optimizes ly quite simple: for example, add a value to DWARF includes the description of the the program, it may move instructions either the machine address or to the line macros defined in the program. This is around or remove them. The code for a giv number, set the column number, or set a quite rudimentary information, but can be en source statement may not be stored as a flag which indicates that the memory ad used by a debugger to display the values sequence of machine instructions, but may dress represents the start of an source state for a macro or possibly translate the macro be scattered and interleaved with the in
into the corresponding source language. structions for other nearby source state 9
ments. It may be useful to identify the end Calling this a line number program is some
thing of a misnomer. The program describes Call Frame Information
of the code which represents the prolog of much more than just line numbers, such as in
very processor has a certain way of a function or the beginning of the epilog, so struction set, beginning of basic blocks, end of calling functions and passing argu
that the debugger can stop after all of the function prolog, etc.
T
M
E
Introduction to the DWARF Debugging Format
8
Michael J. Eager
ments, usually defined in the ABI. In the resented in only a few bits, this means that nize compilations which define the same simplest case, this is the same for each the data consists mostly of zeros10.
type units and eliminate the duplicates.
function and the debugger knows exactly DWARF defines a variable length inte
how to find the argument values and the ger, called Little Endian Base 128 (LEB128 ELF sections
return address for the function. or more commonly ULEB for unsigned val
hile DWARF is defined in a way that For some processors, there may be dif ues and SLEB for signed values), which allows it to be used with any object ferent calling sequences depending on how compresses these integer values. Since the file format, it's most often used with ELF. the function is written, for example, if there loworder bits contain the data and highor Each of the different kinds of DWARF data are more than a certain number of argu der bits consist of all zeros or ones, LEB val are stored in their own section. The names ments. There may be different calling se ues chop off the loworder seven bits of the of these sections all start with ".debug_". quences depending on operating systems. value. If the remaining bits are all zero or For improved efficiency, most references to Compilers will try to optimize the calling one (signextension bits), this is the encod DWARF data use an offset from the start of sequence to make code both smaller and ed value. Otherwise, set the highorder bit the data for the current compilation. This faster. One common optimization is having to one, output this byte, and go on to the avoids the need to relocate the debugging a simple function which doesn't call any next seven loworder bits. data, which speeds up program loading and others (a leaf function) use its caller stack debugging. frame instead of creating its own. Another Shrinking DWARF data
The ELF sections and their contents are
optimization may be to eliminate a register The encoding schemes used by DWARF which points to the current call frame. .debug_abbrev
Abbreviations used in the Some registers may be preserved across the significantly reduce the size of the debug
.debug_info section
call while others are not. While it may be ging information compared to an unencod .debug_aranges
A mapping between possible for the debugger to puzzle out all ed format like DWARF Version 1. Unfortu
memory address and the possible permutations in calling se nately, with many programs the amount of compilation debugging data generated by the compiler quence or optimizations, it is both tedious .debug_frame
Call Frame Information and errorprone. A small change in the op can become quite large, frequently much timizations and the debugger may no larger than the executable code and data. .debug_info
The core DWARF data longer be able to walk the stack to the call
DWARF offers ways to further reduce containing DIEs ing function. the size of the debugging data. Most strings .debug_line
Line Number Program The DWARF Call Frame Information in the DWARF debugging data are actually (CFI) provides the debugger with enough references into a separate .debug_str .debug_loc
Location descriptions
information about how a function is called section. Duplicate strings can be eliminat
so that it can locate each of the arguments ed when generating this section. Potential
.debug_macinfo
Macro descriptions to the function, locate the current call ly, a linker can merge the .debug_str
sections
from
several
compilations
into
a frame, and locate the call frame for the .debug_pubnames A lookup table for global calling function. This information is used by single, smaller string section.
W
the debugger to "unwind the stack," locat
Many programs contain declarations ing the previous function, the location which are duplicated in each compilation where the function was called, and the val unit. For example, debugging data describ
ues passed. ing many (perhaps thousands) declarations Like the line number table, the CFI is of C++ template functions may be repeated encoded as a sequence of instructions that in each compilation. These repeated de
are interpreted to generate a table. There is scriptions can be saved in separate compila
one row in this table for each address that tion units in uniquely named sections. The contains code. The first column contains linker can use COMDAT (common data) the machine address while the subsequent techniques to eliminate the duplicate sec
columns contain the values of the machine tions.
registers when the instruction at that ad
dress is executed. Like the line number ta
ble, if this table were actually created it would be huge. Luckily, very little changes between two machine instructions, so the CFI encoding is quite compact. Variable length data
Many programs reference a large num
ber of include files which contain many type definitions, resulting in DWARF data which contains thousands of DIEs for these types. A compiler can reduce the size of this data by only generating DWARF for the types which are actually used in the compi
lation. With DWARF Version 4, type defini
tions can be saved into a separate .debug_types section. The compilation unit contains a DIE which references this separate type unit and a unique 64bit sig
nature for these types. A linker can recog
objects and functions .debug_pubtypes A lookup table for global types
.debug_ranges
Address ranges referenced by DIEs .debug_str
String table used by .debug_info
.debug_types
Type descriptions
Summary
S
o there you have it ─ DWARF in a nut
shell. Well, not quite a nutshell. The ba
sic concepts for the DWARF debug informa
tion are straightforward. A program is de
scribed as a tree with nodes representing the various functions, data and types in the source in a compact language and ma
chineindependent fashion. The line table provides the mapping between the exe
cutable instructions and the source that 10
An example of this may be seen in the reloca generated them. The CFI describes how to tion directory in an object file, where file offset unwind the stack. Integer values are used throughout DWARF to represent everything from off
sets into data sections to sizes of arrays or structures. In most cases, it isn't possible to place a bound on the size of these values. In a classic data structure each of these val
ues would be represented using the default integer size. Since most values can be rep and relocation values are represented by inte
gers. Most values are have leading zeros.
Introduction to the DWARF Debugging Format
9
Michael J. Eager
There is quite a bit of subtlety in DWARF as well, given that it needs to ex
press the many different nuances for a wide range of programming languages and dif
ferent machine architectures. Future direc
tions for DWARF are to improve the de
scription of optimized code so that debug
gers can better navigate the code which ad
vanced compiler optimizations generate.
The complete DWARF Version 4 Stan
dard is available for download without charge at the DWARF website (dwarfstd.org). There is also a mailing list for questions and discussion about DWARF. Instructions on registering for the mailing list are also on the website.
Acknowledgements
I want to thank Chris Quenelle of Sun Microsystems and Ron Brender, formerly of HP, for their comments and advice about a previous version of this paper. Thanks also to Susan Heimlich for her many editorial comments.
Generating DWARF with GCC
It’s very simple to generate DWARF with gcc. Simply specify the –g option to generate debugging information. The ELF sections can be displayed using objump with the –h option.
$ gcc –g –c strndup.c
$ objdump –h strndup.o
strndup.o:
file format elf32-i386
Sections:
Idx Name
0 .text
1
2
3
4
5
6
7
8
9
10
11
Size
VMA
LMA
File off Algn
0000007b 00000000 00000000 00000034 2**2
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
.data
00000000 00000000 00000000 000000b0 2**2
CONTENTS, ALLOC, LOAD, DATA
.bss
00000000 00000000 00000000 000000b0 2**2
ALLOC
.debug_abbrev 00000073 00000000 00000000 000000b0 2**0
CONTENTS, READONLY, DEBUGGING
.debug_info
00000118 00000000 00000000 00000123 2**0
CONTENTS, RELOC, READONLY, DEBUGGING
.debug_line
00000080 00000000 00000000 0000023b 2**0
CONTENTS, RELOC, READONLY, DEBUGGING
.debug_frame 00000034 00000000 00000000 000002bc 2**2
CONTENTS, RELOC, READONLY, DEBUGGING
.debug_loc
0000002c 00000000 00000000 000002f0 2**0
CONTENTS, READONLY, DEBUGGING
.debug_pubnames 0000001e 00000000 00000000 0000031c 2**0
CONTENTS, RELOC, READONLY, DEBUGGING
.debug_aranges 00000020 00000000 00000000 0000033a 2**0
CONTENTS, RELOC, READONLY, DEBUGGING
.comment
0000002a 00000000 00000000 0000035a 2**0
CONTENTS, READONLY
.note.GNU-stack 00000000 00000000 00000000 00000384 2**0
CONTENTS, READONLY
Printing DWARF using Readelf
Readelf can display and decode the DWARF data in an object or executable file. The options are
-w
-w[liaprmfFso]
l
i
a
p
r
m
f
F
s
o
─
─
─
─
─
─
─
─
─
─
─
─
display all DWARF sections
display specific sections
line table
debug info
abbreviation table
public names
ranges
macro table
debug frame (encoded)
debug frame (decoded)
string table
location lists
The DWARF listing for all but the smallest programs is quite voluminous, so it would be a good idea to direct readelf’s output to a file and then browse the file with less or an editor such as vi.
Introduction to the DWARF Debugging Format
10
Michael J. Eager
Introduction to the DWARF Debugging Format
11
Michael J. Eager

Open as PDF

Similar pages: Introduction to the DWARF debugging format; MOTOROLA SCM69C433TQ15; ETC 20736; ETC CS5030RR; ETC CWDEVSYS2107