Udis86 Manual

Vivek Thampi

2009

Revision History
Revision 1.002 May 2009

Table of Contents

1. Introduction
2. Getting Started
2.1. Building and Installing udis86
2.2. Interfacing with libudis86: A Quick Example
3. libudis86 Programming Interface
3.1. ud_t: udis86 object
3.2. Examining Instructions
3.2.1. Instruction Pointer
3.2.2. Instruction Prefixes
3.2.3. Instruction Mnemonic
3.2.4. Instruction Operands
3.3. Function Reference

1. Introduction

Udis86 is a disassembler engine that interprets and decodes a stream of binary machine code bytes as opcodes defined in the x86 and x86-64 class of Instruction Set Archictures. The core component of this project is libudis86 which provides a clean and simple interface to disassemble binary code, and to inspect the disassembly to various degrees of details. The library is designed to aid software projects that entail analysis and manipulation of all flavors of x86 binary code.

2. Getting Started

2.1. Building and Installing udis86

udis86 is developed for unix-like environments, and like most software, the basic steps towards building and installing it are as follows.
$ ./configure
$ make
$ make install
Depending on your choice of install location, you may need to have root privileges to do a make install. The install scripts copy the necessary header and library files to appropriate locations on the system.

2.2. Interfacing with libudis86: A Quick Example

The following code is an example of a program that interfaces with libudis86 and uses the API to generate assembly language output for 64-bit code, input from STDIN.

Example 1. libudis86 Usage Example

#include <stdio.h>
#include <udis86.h>

int main()
{
    ud_t ud_obj;

    ud_init(&ud_obj);
    ud_set_input_file(&ud_obj, stdin);
    ud_set_mode(&ud_obj, 64);
    ud_set_syntax(&ud_obj, UD_SYN_INTEL);

    while (ud_disassemble(&ud_obj)) {
        printf("\t%s\n", ud_insn_asm(&ud_obj));
    }

    return 0;
} 

To compile the program (using gcc):
 $ gcc -ludis86 example.c -o example 
This example should give you an idea of how this library can be used. The following sections describe, in detail, the complete API of libudis86.

3. libudis86 Programming Interface

3.1. ud_t: udis86 object

libudis86 is reentrant, and to maintain that property it does not use static data. All data related to the disassembly are stored in a single object, called the udis86 object ud_t (struct ud). So, to use libudis86 you must create an instance of this object,
 ud_t ud_obj; 
and initialize it,
 ud_init(&ud_obj); 
You can create multiple such objects and use with the library, each one maintaining it's own disassembly state.

3.2. Examining Instructions

libudis86 exposes decoded instructions in an intermediate form meant to be useful for programs that want to examine them. This intermediate form is available as values of certain fields of the ud_t udis86 object used to disassemble the instruction, as described below.

3.2.1. Instruction Pointer

The program counter (eip/rip) value at which the instruction was decoded, is available in ud_obj.pc

3.2.2. Instruction Prefixes

Prefix bytes that affect the disassembly of the instruction are availabe in the following fields, each of which corressponding to particular type or class of prefixes.

  • ud_obj.pfx_rex - 64-bit mode REX prefix
  • ud_obj.pfx_seg - Segment register prefix
  • ud_obj.pfx_opr - Operand-size prefix (66h)
  • ud_obj.pfx_adr - Address-size prefix (67h)
  • ud_obj.pfx_lock - Lock prefix
  • ud_obj.pfx_rep - Rep prefix
  • ud_obj.pfx_repe - Repe prefix
  • ud_obj.pfx_repne - Repne prefix

These fields default to UD_NONE if the respective prefixes were not found.

3.2.3. Instruction Mnemonic

The instruction mnemonic in the form of an enumerated constant (enum ud_mnemonic_code) is available in ud_obj.mnemonic. As a convention all mnemonic constants are composed by prefixing standard instruction mnemonics with UD_I. For example, UD_Imov, UD_Ixor, UD_Ijmp, etc.

3.2.4. Instruction Operands

The intermediate form for instruction operands are availabe as an array of objects of type struct ud_operand. Given a udis86 object ud_obj, the nth operand is availabe in ud_obj.operand[n].

struct ud_operand has the following fields,

  • type
  • size
  • base
  • index
  • scale
  • offset
  • lval

The type and size fields determine the type and size of the operand, respectively. The possible types of operands are,

  • UD_NONE

    No operand.

  • UD_OP_MEM

    Memory operand. The intermediate form normalizes all memory address equations to the scale-index-base form. The address equation is availabe in base, index, and scale. If the offset field has a non-zero value (one of 8, 16, 32, and 64), lval will contain the memory offset. Note that base and index fields contain the base and index register of the address equation, in the form of an enumerated constant enum ud_type. scale contains an integer value that the index register must be scaled by.

  • UD_OP_PTR

    A Segmet:Offset pointer operand. size can have two values 32 (for 16:16 seg:off) and 48 (for 16:32 seg:off). The value is available in lval (lval.ptr.seg and lval.ptr.off.)

  • UD_OP_IMM

    Immediate operand. Value available in lval.

  • UD_OP_JIMM

    Immediate operand to branch instruction (relative offsets). Value available in lval.

  • UD_OP_CONST

    Implicit constant operand. Value available in lval.

  • UD_OP_REG

    Operand is a register. The specific register is contained in base in the form of an enumerated constant, enum ud_type.

The lval is a union data structure that aggregates integer fields of different sizes, that store values depending on the type of operand.

  • lval.sbyte - Signed Byte
  • lval.ubyte - Unsigned Byte
  • lval.sword - Signed Word
  • lval.uword - Unsigned Word
  • lval.sdword - Signed Double Word
  • lval.udword - Unsigned Double Word
  • lval.sqword - Signed Quad Word
  • lval.uqword - Unsigned Quad Word
  • lval.ptr.seg - Pointer Segment in Segment:Offset
  • lval.ptr.off - Pointer Offset in Segment:Offset

The following enumerated constants (enum ud_type) are possible values for base and index. Note that a value of UD_NONE simply means that the field is not valid for the current instruction.

    UD_NONE,

    /* 8 bit GPRs */
    UD_R_AL,    UD_R_CL,    UD_R_DL,    UD_R_BL,
    UD_R_AH,    UD_R_CH,    UD_R_DH,    UD_R_BH,
    UD_R_SPL,   UD_R_BPL,   UD_R_SIL,   UD_R_DIL,
    UD_R_R8B,   UD_R_R9B,   UD_R_R10B,  UD_R_R11B,
    UD_R_R12B,  UD_R_R13B,  UD_R_R14B,  UD_R_R15B,

    /* 16 bit GPRs */
    UD_R_AX,    UD_R_CX,    UD_R_DX,    UD_R_BX,
    UD_R_SP,    UD_R_BP,    UD_R_SI,    UD_R_DI,
    UD_R_R8W,   UD_R_R9W,   UD_R_R10W,  UD_R_R11W,
    UD_R_R12W,  UD_R_R13W,  UD_R_R14W,  UD_R_R15W,
            
    /* 32 bit GPRs */
    UD_R_EAX,   UD_R_ECX,   UD_R_EDX,   UD_R_EBX,
    UD_R_ESP,   UD_R_EBP,   UD_R_ESI,   UD_R_EDI,
    UD_R_R8D,   UD_R_R9D,   UD_R_R10D,  UD_R_R11D,
    UD_R_R12D,  UD_R_R13D,  UD_R_R14D,  UD_R_R15D,
            
    /* 64 bit GPRs */
    UD_R_RAX,   UD_R_RCX,   UD_R_RDX,   UD_R_RBX,
     UD_R_RSP,  UD_R_RBP,   UD_R_RSI,   UD_R_RDI,
    UD_R_R8,    UD_R_R9,    UD_R_R10,   UD_R_R11,
    UD_R_R12,   UD_R_R13,   UD_R_R14,   UD_R_R15,

    /* segment registers */
    UD_R_ES,    UD_R_CS,    UD_R_SS,    UD_R_DS,
    UD_R_FS,    UD_R_GS,    

    /* control registers*/
    UD_R_CR0,   UD_R_CR1,   UD_R_CR2,   UD_R_CR3,
    UD_R_CR4,   UD_R_CR5,   UD_R_CR6,   UD_R_CR7,
    UD_R_CR8,   UD_R_CR9,   UD_R_CR10,  UD_R_CR11,
    UD_R_CR12,  UD_R_CR13,  UD_R_CR14,  UD_R_CR15,
            
    /* debug registers */
    UD_R_DR0,   UD_R_DR1,   UD_R_DR2,   UD_R_DR3,
    UD_R_DR4,   UD_R_DR5,   UD_R_DR6,   UD_R_DR7,
    UD_R_DR8,   UD_R_DR9,   UD_R_DR10,  UD_R_DR11,
    UD_R_DR12,  UD_R_DR13,  UD_R_DR14,  UD_R_DR15,

    /* mmx registers */
    UD_R_MM0,   UD_R_MM1,   UD_R_MM2,   UD_R_MM3,
    UD_R_MM4,   UD_R_MM5,   UD_R_MM6,   UD_R_MM7,

    /* x87 registers */
    UD_R_ST0,   UD_R_ST1,   UD_R_ST2,   UD_R_ST3,
    UD_R_ST4,   UD_R_ST5,   UD_R_ST6,   UD_R_ST7, 

    /* extended multimedia registers */
    UD_R_XMM0,  UD_R_XMM1,  UD_R_XMM2,  UD_R_XMM3,
    UD_R_XMM4,  UD_R_XMM5,  UD_R_XMM6,  UD_R_XMM7,
    UD_R_XMM8,  UD_R_XMM9,  UD_R_XMM10, UD_R_XMM11,
    UD_R_XMM12, UD_R_XMM13, UD_R_XMM14, UD_R_XMM15,

    /* eip/rip */
    UD_R_RIP 

3.3. Function Reference

  • void ud_init (ud_t* ud_obj)

    ud_t object initializer. This function must be called on a udis86 object before it can used anywhere else.

  • void ud_set_input_hook(ud_t* ud_obj, int (*hook)(ud_t*))

    This function sets the input source for the library. To retrieve each byte in the stream, libudis86 calls back the function pointed to by hook. The hook function, defined by the user code, must return a single byte of code each time it is called. To signal end-of-input, it must return the constant, UD_EOI.

  • void ud_set_user_opaque_data(ud_t* ud_obj, void* opaque);

    Associates a pointer with the udis86 object to be retrieved and used in user functions, such as the input hook callback function.

  • void* ud_get_user_opaque_data(ud_t* ud_obj);

    This function returns any pointer associated with the udis86 object, using the ud_set_opaque_data function.

  • void ud_set_input_buffer(ud_t* ud_obj, unsigned char* buffer, size_t size);

    Sets the input source for the library to a buffer of fixed size.

  • void ud_set_input_file(ud_t* ud_obj, FILE* filep);

    This function sets the input source for the library to a file pointed to by the passed FILE pointer. Note that the library does not perform any checks, assuming the file pointer to be properly initialized.

  • void ud_set_mode(ud_t* ud_obj, uint8_t mode_bits);

    Sets the mode of disassembly. Possible values are 16, 32, and 64. By default, the library works in 32bit mode.

  • void ud_set_pc(ud_t*, uint64_t pc);

    Sets the program counter (EIP/RIP). This changes the offset of the assembly output generated, with direct effect on branch instructions.

  • void ud_set_syntax(ud_t*, void (*translator)(ud_t*));

    libudis86 disassembles one instruction at a time into an intermediate form that lets you inspect the instruction and its various aspects individually. But to generate the assembly language output, this intermediate form must be translated. This function sets the translator. There are two inbuilt translators,

    • UD_SYN_INTEL - for INTEL (NASM-like) syntax.
    • UD_SYN_ATT - for AT&T (GAS-like) syntax.

    If you do not want libudis86 to translate, you can pass NULL to the function, with no more translations thereafter. This is particularly useful for cases when you only want to identify chunks of code and then create the assembly output if needed.

    If you want to create your own translator, you must pass a pointer to function that accepts a pointer to ud_t. This function will be called by libudis86 after each instruction is decoded.

  • void ud_set_vendor(ud_t*, unsigned vendor);

    Sets the vendor of whose instruction to choose from. This is only useful for selecting the VMX or SVM instruction sets at which point INTEL and AMD have diverged significantly. At a later stage, support for a more granular selection of instruction sets maybe added.

    • UD_VENDOR_INTEL - for INTEL instruction set.
    • UD_VENDOR_AMD - for AMD instruction set.

  • unsigned int ud_disassemble(ud_t*);

    Disassembles the next instruction in the input stream. Returns the number of bytes disassembled. A 0 indicates end of input. Note, to restart disassembly, after the end of input, you must call one of the input setting functions with the new input source.

  • unsigned int ud_insn_len(ud_t* u);

    Returns the number of bytes disassembled.

  • uint64_t ud_insn_off(ud_t*);

    Returns the starting offset of the disassembled instruction relative to the program counter value specified initially.

  • char* ud_insn_hex(ud_t*);

    Returns pointer to character string holding the hexadecimal representation of the disassembled bytes.

  • uint8_t* ud_insn_ptr(ud_t* u);

    Returns pointer to the buffer holding the instruction bytes. Use ud_insn_len(), to determine the length of this buffer.

  • char* ud_insn_asm(ud_t* u);

    If the syntax is specified, returns pointer to the character string holding assembly language representation of the disassembled instruction.

  • void ud_input_skip(ud_t*, size_t n);

    Skips n number of bytes in the input stream