Assembly is more significant in LINUX system than in WINDOWS system. The code of LINUX kernel is compiled in assembly. Then, most Linux programmers have only been exposed to assembly language under DOS/Windows before, and these assembly codes are all Intel-style. But in Unix and Linux systems, AT&T format is more used. There are great differences in the grammatical format between them. Therefore, we should have a basic understanding and familiarity with AT&T assembly.
We write a simple helloworld program in C under LINUX with the command hello.c.
#include <stdio.h> int main() { printf("hello,world\n"); exit(0); }
Then, GCC is used to compile and - s parameter is used to generate intermediate assembly code to see the true face of AT&T assembly.
.section .data#Initialized variables output: .ascii "hello,world\n" #The string to be printed,.Data is a variable of the initialization value. output is a label indicating where the string begins, and ascii is a data type .section .bss#Uninitialized variables, buffer filled by 0 .lcomm num,20 #lcomm is a local memory area that cannot be accessed outside the local assembly. comm is the general memory area. .section .text#Assembly Language Instruction Code .globl _start#Start entry _start: movl $4,%eax#Called system function, 4 for write movl $output,%ecx#String to print movl $1,%ebx#File descriptor, screen 1 movl $12,%edx#String length int $0x80#Display the string hello,world movl $0,%eax movl $num,%edi movl $65,1(%edi)#ascii of A movl $66,2(%edi)#ascii of B movl $67,3(%edi)#ascii of C movl $68,4(%edi)#ascii of D movl $10,5(%edi)#ascii of \n movl $4,%eax#Called system function, 4 for write movl $num,%ecx#String to print movl $1,%ebx#File descriptor, screen 1 movl $6,%edx#String length int $0x80#Display string ABCD movl $1,%eax#1 quit movl $0,%ebx#The exit code value returned to the shell int $0x80#Kernel Soft Interrupt, Exit System gcc -S hello.c .file "hello.c" .section .rodata .LC0: .string "hello,world" .text .globl main .type main, @function main: pushl %ebp movl %esp, %ebp andl $-16, %esp subl $16, %esp movl $.LC0, (%esp) call puts movl $0, (%esp) call exit .size main, .-main .ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3" .section .note.GNU-stack,"",@progbits
The function of assembler is to convert source program written in assembly language into object code in binary form. The standard assembler for Linux platform is GAS, which is a backend assembly tool relied on by GCC and is usually included in the binutils package.
AT&T compilation has the following characteristics:
1. In AT&T assembly format, the register name should be prefixed with'%'.
Such as:
Copy the contents of eax registers into ebx
movl %eax,%ebx
2. Represent an immediate operand with the'$'prefix.
For example, copy 1 to eax
movl $1, %eax
3. The target operand is on the right of the source operand
movl %eax,%ebx
eax is the source operand and ebx is the target operand
4. In AT&T assembly format, the word length of the operand is determined by the last letter of the operator. The suffixes'b','w','l'denote that the operand is byte (8 bits), word (16 bits) and long word (32 bits), respectively.
For example:
movl operates on 32 bits and copies the contents of eax registers 32 bits into ebx
movl %eax, %ebx
Mow operates on 16 bits and copies the contents of the ax register into bx
movw %ax, %bx
Mob operates on 8 bits and copies the contents of the al register into bl
movb %al, %bl
Let's take stacking as an example:
Pushl% ecx 32-bit ECX content stacking
Pushw% CX 16-bit ecx content on the stack
pushl 0 # 80 is stacked as a 32-bit integer
Puhl data data variable content is stacked, 32 bits long
The operation pushl $data # is very special, adding $before the variable to indicate the address of the variable to be taken, which is to put the address of the data variable on the stack.
5. In AT&T assembly format, the number of operations of jump/call should be prefixed with'*'.
6. Operating codes for remote transfer instructions and remote subcall instructions, ljump and lcall in AT&T assembly format
We can see these characteristics from the generated intermediate code.
Let's take another look at a helloworld program written in AT&T assembly.
.section .data#Initialized variables output: .ascii "hello,world\n" #The string to be printed,.Data is a variable of the initialization value. output is a label indicating where the string begins, and ascii is a data type .section .bss#Uninitialized variables, buffer filled by 0 .lcomm num,20 #lcomm is a local memory area that cannot be accessed outside the local assembly. comm is the general memory area. .section .text#Assembly Language Instruction Code .globl _start#Start entry _start: movl $4,%eax#Called system function, 4 for write movl $output,%ecx#String to print movl $1,%ebx#File descriptor, screen 1 movl $12,%edx#String length int $0x80#Display the string hello,world movl $0,%eax movl $num,%edi movl $65,1(%edi)#ascii of A movl $66,2(%edi)#ascii of B movl $67,3(%edi)#ascii of C movl $68,4(%edi)#ascii of D movl $10,5(%edi)#ascii of \n movl $4,%eax#Called system function, 4 for write movl $num,%ecx#String to print movl $1,%ebx#File descriptor, screen 1 movl $6,%edx#String length int $0x80#Display string ABCD movl $1,%eax#1 quit movl $0,%ebx#The exit code value returned to the shell int $0x80#Kernel Soft Interrupt, Exit System
We will explain the structure and content of the above assembly code:
1. Section. data segment stores initialized variables, section. BSS segment stores uninitialized variables.
2. Variables are defined in the following format:
Variable name:
Variable type variable value
This is how the output variable in the above code is defined
output:
.ascii "hello,world\n"
The following example defines multiple variables
.section .data
msg:
.ascii "This is a text"
x:
.double 109.45, 2.33, 19.16
y:
.int 89
z:
.int 21, 85, 27
.equ a 8
Among them, msg is a character character character, x is a double-precision character number, y and z are integers, A is a special definition, it is a static variable definition, using. equ variable name variable value to achieve.
3. The definition rules of variable access area in section. BSS are as follows:
lcomm is a local memory area, i.e., it cannot be accessed outside the local assembly, while. comm is a general memory area.
For example, the definition above
.lcomm num,20
num is the local memory area.
4. Section. text is assembly language instruction code, and the code marked with. globl_start is program start entry.
5. # denotes annotations. The rest of the above code is annotated. Programmers with assembly-based code should be easy to understand.
Variables are of the following types:
ascii text string
Asiz text string ending with NULL
byte
Double double double double precision number of characters
float single-precision number of characters
int 32-bit integer
long 32-bit integer
octa 16-bit integer
quad 8-bit integer
short 16-bit integer
Single single-precision number of characters
In addition, AT&T assembly often involves operations such as byte order inversion, comparative loading, swapping, pressing, popping up all registers, etc. The following examples cover these operations.
Each line of code has detailed comments.
The data elements defined in the. bss segment are uninitialized variables, which are initialized at run time.
It can be divided into data common memory area and local common memory area.
The local common memory area cannot be accessed from outside the local assembly code.
text segment stores code