5.1. Decompress Kernel
The segment base addresses in segment descriptors (which correspond to segment selector __KERNEL_CS and __KERNEL_DS) are equal to 0; therefore, the logical address offset (in segment:offset format) will be equal to its linear address if either of these segment selectors is used. For zImage, CS:EIP is at logical address 10:1000 (linear address 0x1000) now; for bzImage, 10:100000 (linear address 0x100000).
As paging is not enabled, linear address is identical to physical address. Check IA-32 Manual (Vol.1. Ch.3.3. Memory Organization, and Vol.3. Ch.3. Protected-Mode Memory Management) and
Linux Device Drivers: Memory Management in Linux for address issue.
It comes from setup.S that BX=0 and ESI=INITSEG<<4.
| .text///////////////////////////////////////////////////////////////////////////////startup_32(){ cld; cli; DS = ES = FS = GS = __KERNEL_DS; SS:ESP = *stack_start; // end of user_stack[], defined in misc.c // all segment registers are reloaded after protected mode is enabled // check that A20 really IS enabled EAX = 0; do {1: DS:[0] = ++EAX; } while (DS:[0x100000]==EAX); EFLAGS = 0; clear BSS; // from _edata to _end struct moveparams mp; // subl $16,%esp if (!decompress_kernel(&mp, ESI)) { // return value in AX restore ESI from stack; EBX = 0; goto __KERNEL_CS:100000; // see linux/arch/i386/kernel/head.S:startup_32 } /* * We come here, if we were loaded high. * We need to move the move-in-place routine down to 0x1000 * and then start it with the buffer addresses in registers, * which we got from the stack. */3: move move_rountine_start..move_routine_end to 0x1000; // move_routine_start & move_routine_end are defined below // prepare move_routine_start() parameters EBX = real mode pointer; // ESI value passed from setup.S ESI = mp.low_buffer_start; ECX = mp.lcount; EDX = mp.high_buffer_star; EAX = mp.hcount; EDI = 0x100000; cli; // make sure we don't get interrupted goto __KERNEL_CS:1000; // move_routine_start();}/* Routine (template) for moving the decompressed kernel in place, * if we were high loaded. This _must_ PIC-code ! *////////////////////////////////////////////////////////////////////////////////move_routine_start(){ move mp.low_buffer_start to 0x100000, mp.lcount bytes, in two steps: (lcount >> 2) words + (lcount & 3) bytes; move/append mp.high_buffer_start, ((mp.hcount + 3) >> 2) words // 1 word == 4 bytes, as I mean 32-bit code/data. ESI = EBX; // real mode pointer, as that from setup.S EBX = 0; goto __KERNEL_CS:100000; // see linux/arch/i386/kernel/head.S:startup_32()move_routine_end:} |
For the meaning of "je 1b" and "jnz 3f", refer to
Using as: Local Symbol Names.
Didn't find _edata and _end definitions? No problem, they are defined in the "internal linker script". Without -T (--script=) option specified, ld uses this builtin script to link compressed/bvmlinux. Use "ld --verbose" to display this script, or check Appendix B. Internal Linker Script.
Refer to
Using LD, the GNU linker: Command Line Options for -T (--script=), -L (--library-path=) and --verbose option description. "man ld" and "info ld" may help too.
piggy.o has been unzipped and control is passed to __KERNEL_CS:100000, i.e. linux/arch/i386/kernel/head.S:startup_32(). See Section 6.
| #define LOW_BUFFER_START 0x2000#define LOW_BUFFER_MAX 0x90000#define HEAP_SIZE 0x3000///////////////////////////////////////////////////////////////////////////////asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)|-- setup real_mode(=rmode), vidmem, vidport, lines and cols;|-- if (is_zImage) setup_normal_output_buffer() {| output_data = 0x100000;| free_mem_end_ptr = real_mode;| } else (is_bzImage) setup_output_buffer_if_we_run_high(mv) {| output_data = LOW_BUFFER_START;| low_buffer_end = MIN(real_mode, LOW_BUFFER_MAX) & ~0xfff;| low_buffer_size = low_buffer_end - LOW_BUFFER_START;| free_mem_end_ptr = &end + HEAP_SIZE;| // get mv->low_buffer_start and mv->high_buffer_start| mv->low_buffer_start = LOW_BUFFER_START;| /* To make this program work, we must have| * high_buffer_start > &end+HEAP_SIZE;| * As we will move low_buffer from LOW_BUFFER_START to 0x100000| * (max low_buffer_size bytes) finally, we should have| * high_buffer_start > 0x100000+low_buffer_size; */| mv->high_buffer_start = high_buffer_start| = MAX(&end+HEAP_SIZE, 0x100000+low_buffer_size);| mv->hcount = 0 if (0x100000+low_buffer_size > &end+HEAP_SIZE);| = -1 if (0x100000+low_buffer_size <= &end+HEAP_SIZE);| /* mv->hcount==0 : we need not move high_buffer later,| * as it is already at 0x100000+low_buffer_size.| * Used by close_output_buffer_if_we_run_high() below. */| }|-- makecrc(); // create crc_32_tab[]| puts("Uncompressing Linux... ");|-- gunzip();| puts("Ok, booting the kernel.\n");|-- if (is_bzImage) close_output_buffer_if_we_run_high(mv) {| // get mv->lcount and mv->hcount| if (bytes_out > low_buffer_size) {| mv->lcount = low_buffer_size;| if (mv->hcount)| mv->hcount = bytes_out - low_buffer_size;| } else {| mv->lcount = bytes_out;| mv->hcount = 0;| }| }`-- return is_bzImage; // return value in AX |
end is defined in the "internal linker script" too.
decompress_kernel() has an "asmlinkage" modifer. In linux/include/linux/linkage.h:| #ifdef __cplusplus#define CPP_ASMLINKAGE extern "C"#else#define CPP_ASMLINKAGE#endif#if defined __i386__#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))#elif defined __ia64__#define asmlinkage CPP_ASMLINKAGE __attribute__((syscall_linkage))#else#define asmlinkage CPP_ASMLINKAGE#endif |
Macro "asmlinkage" will force the compiler to pass all function arguments on the stack, in case some optimization method may try to change this convention. CheckUsing the GNU Compiler Collection (GCC): Declaring Attributes of Functions (regparm) andKernelnewbies FAQ: What is asmlinkage for more details.
5.2. gunzip()
decompress_kernel() calls gunzip() -> inflate(), which are defined in linux/lib/inflate.c, to decompress resident kernel image to low buffer (pointed by output_data) and high buffer (pointed by high_buffer_start, for bzImage only).
The gzip file format is specified in RFC 1952. Table 6. gzip file format
ComponentMeaningByteComment
ID1IDentification 1131 (0x1f, \037)
ID2IDentification 21139 (0x8b, \213) [a]
CMCompression Method18 - denotes the "deflate" compression method
FLGFLaGs10 for most cases
MTIMEModification TIME4modification time of the original file
XFLeXtra FLags12 - compressor used maximum compression, slowest algorithm [b]
OSOperating System13 - Unix
extra fields--variable length, field indicated by FLG [c]
compressed blocks--variable length
CRC32-4CRC value of the uncompressed data
ISIZEInput SIZE4the size of the uncompressed input data modulo 2^32
Notes:
a. ID2 value can be 158 (0x9e, \236) for gzip 0.5;
b. XFL value 4 - compressor used fastest algorithm;
c. FLG bit 0, FTEXT, does not indicate any "extra field".
We can use this file format knowledge to find out the beginning of gzipped linux/vmlinux.| [root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | grep '1f 8b 08 00'00004c50 1f 8b 08 00 01 f6 e1 3f 02 03 ec 5d 7d 74 14 55 |.......?...]}t.U|[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 -s 0x4c40 -n 6400004c40 00 80 0b 00 00 fc 21 00 68 00 00 00 1e 01 11 00 |......!.h.......|00004c50 1f 8b 08 00 01 f6 e1 3f 02 03 ec 5d 7d 74 14 55 |.......?...]}t.U|00004c60 96 7f d5 a9 d0 1d 4d ac 56 93 35 ac 01 3a 9c 6a |......M.V.5..:.j|00004c70 4d 46 5c d3 7b f8 48 36 c9 6c 84 f0 25 88 20 9f |MF\.{.H6.l..%. .|00004c80[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | tail -n 400114d40 bd 77 66 da ce 6f 3d d6 33 5c 14 a2 9f 7e fa e9 |.wf..o=.3\...~..|00114d50 a7 9f 7e fa ff 57 3f 00 00 00 00 00 d8 bc ab ea |..~..W?.........|00114d60 44 5d 76 d1 fd 03 33 58 c2 f0 00 51 27 00 |D]v...3X...Q'.|00114d6e |
We can see that the gzipped file begins at 0x4c50 in the above example. The four bytes before "1f 8b 08 00" is input_len (0x0011011e, in little endian), and 0x4c50+0x0011011e=0x114d6e equals to the size of bzImage (/boot/vmlinuz-2.4.20-28.9).
| static uch *inbuf; /* input buffer */static unsigned insize = 0; /* valid bytes in inbuf */static unsigned inptr = 0; /* index of next byte to be processed in inbuf *////////////////////////////////////////////////////////////////////////////////static int gunzip(void){ Check input buffer for {ID1, ID2, CM}, must be {0x1f, 0x8b, 0x08} (normal case), or {0x1f, 0x9e, 0x08} (for gzip 0.5); Check FLG (flag byte), must not set bit 1, 5, 6 and 7; Ignore {MTIME, XFL, OS}; Handle optional structures, which correspond to FLG bit 2, 3 and 4; inflate(); // handle compressed blocks Validate {CRC32, ISIZE};} |
When get_byte(), defined in linux/arch/i386/boot/compressed/misc.c, is called for the first time, it calls fill_inbuf() to setup input buffer inbuf=input_data and insize=input_len. Symbol input_data and input_len are defined in piggy.o linker script. See Section 2.5.
5.3. inflate()
| // some important definitions in misc.c#define WSIZE 0x8000 /* Window size must be at least 32k, * and a power of two */static uch window[WSIZE]; /* Sliding window buffer */static unsigned outcnt = 0; /* bytes in output buffer */// linux/lib/inflate.c#define wp outcnt#define flush_output(w) (wp=(w),flush_window())STATIC unsigned long bb; /* bit buffer */STATIC unsigned bk; /* bits in bit buffer */STATIC unsigned hufts; /* track memory usage */static long free_mem_ptr = (long)&end;///////////////////////////////////////////////////////////////////////////////STATIC int inflate(){ int e; /* last block flag */ int r; /* result code */ unsigned h; /* maximum struct huft's malloc'ed */ void *ptr; wp = bb = bk = 0; // inflate compressed blocks one by one do { hufts = 0; gzip_mark() { ptr = free_mem_ptr; }; if ((r = inflate_block(&e)) != 0) { gzip_release() { free_mem_ptr = ptr; }; return r; } gzip_release() { free_mem_ptr = ptr; }; if (hufts > h) h = hufts; } while (!e); /* Undo too much lookahead. The next read will be byte aligned so we * can discard unused bits in the last meaningful byte. */ while (bk >= 8) { bk -= 8; inptr--; } /* write the output window window[0..outcnt-1] to output_data, * update output_ptr/output_data, crc and bytes_out accordingly, and * reset outcnt to 0. */ flush_output(wp); /* return success */ return 0;} |
free_mem_ptr is used in misc.c:malloc() for dynamic memory allocation. Before inflating each compressed block, gzip_mark() saves the value of free_mem_ptr; After inflation, gzip_release() will restore this value. This is how it "free()" the memory allocated in inflate_block().
Gzip uses Lempel-Ziv coding (LZ77) to compress files. The compressed data format is specified in RFC 1951. inflate_block() will inflate compressed blocks, which can be treated as a bit sequence.
The data structure of each compressed block is outlined below:| BFINAL (1 bit) 0 - not the last block 1 - the last blockBTYPE (2 bits) 00 - no compression remaining bits until the byte boundary; LEN (2 bytes); NLEN (2 bytes, the one's complement of LEN); data (LEN bytes); 01 - compressed with fixed Huffman codes { literal (7-9 bits, represent code 0..287, excluding 256); // See RFC 1951, table in Paragraph 3.2.6. length (0-5 bits if literal > 256, represent length 3..258); // See RFC 1951, 1st alphabet table in Paragraph 3.2.5. data (of literal bytes if literal < 256); distance (5 plus 0-13 extra bits if literal == 257..285, represent distance 1..32768); /* See RFC 1951, 2nd alphabet table in Paragraph 3.2.5, * but statement in Paragraph 3.2.6. */ /* Move backward "distance" bytes in the output stream, * and copy "length" bytes */ }* // can be of multiple instances literal (7 bits, all 0, literal == 256, means end of block); 10 - compressed with dynamic Huffman codes HLIT (5 bits, # of Literal/Length codes - 257, 257-286); HDIST (5 bits, # of Distance codes - 1, 1-32); HCLEN (4 bits, # of Code Length codes - 4, 4 - 19); Code Length sequence ((HCLEN+4)*3 bits) /* The following two alphabet tables will be decoded using * the Huffman decoding table which is generated from * the preceeding Code Length sequence. */ Literal/Length alphabet (HLIT+257 codes) Distance alphabet (HDIST+1 codes) // Decoding tables will be built from these alphpabet tables. /* The following is similar to that of fixed Huffman codes portion, * except that they use different decoding tables. */ { literal/length (variable length, depending on Literal/Length alphabet); data (of literal bytes if literal < 256); distance (variable length if literal == 257..285, depending on Distance alphabet); }* // can be of multiple instances literal (literal value 256, which means end of block); 11 - reserved (error) |
Note that data elements are packed into bytes starting from Least-Significant Bit (LSB) to Most-Significant Bit (MSB), while Huffman codes are packed starting with MSB. Also note that literal value 286-287 and distance codes 30-31 will never actually occur.
With the above data structure in mind and RFC 1951 by hand, it is not too hard to understand inflate_block(). Refer to related paragraphs in RFC 1951 for Huffman coding and alphabet table generation.
For more details, refer to linux/lib/inflate.c, gzip source code (many in-line comments) and related reference materials.
5.4. Reference
Using as
Using LD, the GNU linker
IA-32 Intel Architecture Software Developer's Manual
The gzip home page
gzip (freshmeat.net)
RFC 1951: DEFLATE Compressed Data Format Specification version 1.3
RFC 1952: GZIP file format specification version 4.3
6. linux/arch/i386/kernel/head.S
Resident kernel image linux/vmlinux is in place finally! It requires two inputs: ESI, to indicate where the 16-bit real mode code is located, aka INITSEG<<4;
BX, to indicate which CPU is running, 0 means BSP, other values for AP.
ESI points to the parameter area from the 16-bit real mode code, which will be copied to empty_zero_page later. ESI is only valid for BSP.
BSP (BootStrap Processor) and APs (Application Processors) are Intel terminologies. Check IA-32 Manual (Vol.3. Ch.7.5. Multiple-Processor (MP) Initialization) and
MultiProcessor Specification for MP intialization issue.
From a software point of view, in a multiprocessor system, BSP and APs share the physical memory but use their own register sets. BSP runs the kernel code first, setups OS execution enviornment and triggers APs to run over it too. AP will be sleeping until BSP kicks it.
6.1. Enable Paging
| .text///////////////////////////////////////////////////////////////////////////////startup_32(){ /* set segments to known values */ cld; DS = ES = FS = GS = __KERNEL_DS;#ifdef CONFIG_SMP#define cr4_bits mmu_cr4_features-__PAGE_OFFSET /* long mmu_cr4_features defined in linux/arch/i386/kernel/setup.c * __PAGE_OFFSET = 0xC0000000, i.e. 3G */ // AP with CR4 support (> Intel 486) will copy CR4 from BSP if (BX && cr4_bits) { // turn on paging options (PSE, PAE, ...) CR4 |= cr4_bits; } else#endif { /* only BSP initializes page tables (pg0..empty_zero_page-1) * pg0 at .org 0x2000 * empty_zero_page at .org 0x4000 * total (0x4000-0x2000)/4 = 0x0800 entries */ pg0 = { 0x00000007, // 7 = PRESENT + RW + USER 0x00001007, // 0x1000 = 4096 = 4K 0x00002007, ... pg1: 0x00400007, ... 0x007FF007 // total 8M empty_zero_page: }; } |
Why do we have to add "-__PAGE_OFFSET" when referring a kernel symbol, for example, like pg0?
In linux/arch/i386/vmlinux.lds, we have:| . = 0xC0000000 + 0x100000; _text = .; /* Text and read-only data */ .text : { *(.text)... |
As pg0 is at offset 0x2000 of section .text in linux/arch/i386/kernel/head.o, which is the first file to be linked for linux/vmlinux, it will be at offset 0x2000 in output section .text. Thus it will be located at address 0xC0000000+0x100000+0x2000 after linking.| [root@localhost boot]# nm --defined /boot/vmlinux-2.4.20-28.9 | grep 'startup_32\|mmu_cr4_features\|pg0\|\<empty_zero_page\>' | sortc0100000 t startup_32c0102000 T pg0c0104000 T empty_zero_pagec0376404 B mmu_cr4_features |
In protected mode without paging enabled, linear address will be mapped directly to physical address. "movl $pg0-__PAGE_OFFSET,%edi" will set EDI=0x102000, which is equal to the physical address of pg0 (as linux/vmlinux is relocated to 0x100000). Without this "-PAGE_OFFSET" scheme, it will access physical address 0xC0102000, which will be wrong and probably beyond RAM space.
mmu_cr4_features is in .bss section and is located at physical address 0x376404 in the above example.
After page tables are initialized, paging can be enabled.| // set page directory base pointer, physical address CR3 = swapper_pg_dir - __PAGE_OFFSET; // paging enabled! CR0 |= 0x80000000; // set PG bit goto 1f; // flush prefetch-queue1: EAX = &1f; // address following the next instruction goto *(EAX); // relocate EIP1: SS:ESP = *stack_start; |
Page directory swapper_pg_dir (see definition in Section 6.5), together with page tables pg0 and pg1, defines that both linear address 0..8M-1 and 3G..3G+8M-1 are mapped to physical address 0..8M-1. We can access kernel symbols without "-__PAGE_OFFSET" from now on, because kernel space (resides in linear address >=3G) will be correctly mapped to its physical addresss after paging is enabled.
"lss stack_start,%esp" (SS:ESP = *stack_start) is the first example to reference a symbol without "-PAGE_OFFSET", which sets up a new stack. For BSP, the stack is at the end of init_task_union. For AP, stack_start.esp has been redefined by linux/arch/i386/kernel/smpboot.c:do_boot_cpu() to be "(void *) (1024 + PAGE_SIZE + (char *)idle)" in Section 8.2.
For paging mechanism and data structures, refer to IA-32 Manual Vol.3. (Ch.3.7. Page Translation Using 32-Bit Physical Addressing, Ch.9.8.3. Initializing Paging, Ch.9.9.1. Switching to Protected Mode, and Ch.18.26.3. Enabling and Disabling Paging).
6.2. Get Kernel Parameters
| #define OLD_CL_MAGIC_ADDR 0x90020#define OLD_CL_MAGIC 0xA33F#define OLD_CL_BASE_ADDR 0x90000#define OLD_CL_OFFSET 0x90022#define NEW_CL_POINTER 0x228 /* Relative to real mode data */#ifdef CONFIG_SMP if (BX) { EFLAGS = 0; // AP clears EFLAGS } else#endif { // Initial CPU cleans BSS clear BSS; // i.e. __bss_start .. _end setup_idt() { /* idt_table[256] defined in arch/i386/kernel/traps.c * located in section .data.idt EAX = __KERNEL_CS << 16 + ignore_int; DX = 0x8E00; // interrupt gate, dpl = 0, present idt_table[0..255] = {EAX, EDX}; } EFLAGS = 0; /* * Copy bootup parameters out of the way. First 2kB of * _empty_zero_page is for boot parameters, second 2kB * is for the command line. */ move *ESI (real-mode header) to empty_zero_page, 2KB; clear empty_zero_page+2K, 2KB; ESI = empty_zero_page[NEW_CL_POINTER]; if (!ESI) { // 32-bit command line pointer if (OLD_CL_MAGIC==(uint16)[OLD_CL_MAGIC_ADDR]) { ESI = [OLD_CL_BASE_ADDR] + (uint16)[OLD_CL_OFFSET]; move *ESI to empty_zero_page+2K, 2KB; } } else { // valid in 2.02+ move *ESI to empty_zero_page+2K, 2KB; } }} |
For BSP, kernel parameters are copied from memory pointed by ESI to empty_zero_page. Kernel command line will be copied to empty_zero_page+2K if applicable.
6.3. Check CPU Type
Refer to IA-32 Manual Vol.1. (Ch.13. Processor Identification and Feature Determination) on how to identify processor type and processor features.
| struct cpuinfo_x86; // see include/asm-i386/processor.hstruct cpuinfo_x86 boot_cpu_data; // see arch/i386/kernel/setup.c#define CPU_PARAMS SYMBOL_NAME(boot_cpu_data)#define X86 CPU_PARAMS+0#define X86_VENDOR CPU_PARAMS+1#define X86_MODEL CPU_PARAMS+2#define X86_MASK CPU_PARAMS+3#define X86_HARD_MATH CPU_PARAMS+6#define X86_CPUID CPU_PARAMS+8#define X86_CAPABILITY CPU_PARAMS+12#define X86_VENDOR_ID CPU_PARAMS+28checkCPUtype:{ X86_CPUID = -1; // no CPUID X86 = 3; // at least 386 save original EFLAGS to ECX; flip AC bit (0x40000) in EFLAGS; if (AC bit not changed) goto is386; X86 = 4; // at least 486 flip ID bit (0X200000) in EFLAGS; restore original EFLAGS; // for AC & ID flags if (ID bit can not be changed) goto is486; // get CPU info CPUID(EAX=0); X86_CPUID = EAX; X86_VENDOR_ID = {EBX, EDX, ECX}; if (!EAX) goto is486; CPUID(EAX=1); CL = AL; X86 = AH & 0x0f; // family X86_MODEL = (AL & 0xf0) >> 4; // model X86_MASK = CL & 0x0f; // stepping id X86_CAPABILITY = EDX; // feature |
Refer to IA-32 Manual Vol.3. (Ch.9.2. x87 FPU Initialization, and Ch.18.14. x87 FPU) on how to setup x87 FPU.
| is486: // save PG, PE, ET and set AM, WP, NE, MP EAX = (CR0 & 0x80000011) | 0x50022; goto 2f; // skip "is386:" processingis386: restore original EFLAGS from ECX; // save PG, PE, ET and set MP EAX = (CR0 & 0x80000011) | 0x02; /* ET: Extension Type (bit 4 of CR0). * In the Intel 386 and Intel 486 processors, this flag indicates * support of Intel 387 DX math coprocessor instructions when set. * In the Pentium 4, Intel Xeon, and P6 family processors, * this flag is hardcoded to 1. * -- IA-32 Manual Vol.3. Ch.2.5. Control Registers (p.2-14) */2: CR0 = EAX; check_x87() { /* We depend on ET to be correct. * This checks for 287/387. */ X86_HARD_MATH = 0; clts; // CR0.TS = 0; fninit; // Init FPU; fstsw AX; // AX = ST(0); if (AL) { CR0 ^= 0x04; // no coprocessor, set EM } else { ALIGN1: X86_HARD_MATH = 1; /* IA-32 Manual Vol.3. Ch.18.14.7.14. FSETPM Instruction * inform 287 that processor is in protected mode * 287 only, ignored by 387 */ fsetpm; } }} |
Macro ALIGN, defined in linux/include/linux/linkage.h, specifies 16-bytes alignment and fill value 0x90 (opcode for NOP). See also
Using as: Assembler Directives for the meaning of directive .align.
6.4. Go Start Kernel
| ready: .byte 0; // global variable{ ready++; // how many CPUs are ready lgdt gdt_descr; // use new descriptor table in safe place lidt idt_descr; goto __KERNEL_CS:$1f; // reload segment registers after "lgdt"1: DS = ES = FS = GS = __KERNEL_DS;#ifdef CONFIG_SMP SS = __KERNEL_DS; // reload segment only#else SS:ESP = *stack_start; /* end of init_task_union, defined * in linux/arch/i386/kernel/init_task.c */#endif EAX = 0; lldt AX; cld;#ifdef CONFIG_SMP if (1!=ready) { // not first CPU initialize_secondary(); // see linux/arch/i386/kernel/smpboot.c } else#endif { start_kernel(); // see linux/init/main.c }L6: goto L6;} |
The first CPU (BSP) will call linux/init/main.c:start_kernel() and the others (AP) will calllinux/arch/i386/kernel/smpboot.c:initialize_secondary(). See start_kernel() in Section 7 and initialize_secondary() in Section 8.4.
init_task_union happens to be the task struct for the first process, "idle" process (pid=0), whose stack grows from the tail of init_task_union. The following is the code related to init_task_union:| ENTRY(stack_start) .long init_task_union+8192; .long __KERNEL_DS;#ifndef INIT_TASK_SIZE# define INIT_TASK_SIZE 2048*sizeof(long)#endifunion task_union { struct task_struct task; unsigned long stack[INIT_TASK_SIZE/sizeof(long)];};/* INIT_TASK is used to set up the first task table, touch at * your own risk! Base=0, limit=0x1fffff (=2MB) */union task_union init_task_union __attribute__((__section__(".data.init_task"))) = { INIT_TASK(init_task_union.task) }; |
init_task_union is for BSP "idle" process. Don't confuse it with "init" process, which will be mentioned in Section 7.2.
6.5. Miscellaneous
| ///////////////////////////////////////////////////////////////////////////////// default interrupt "handler"ignore_int() { printk("Unknown interrupt\n"); iret; }/* * The interrupt descriptor table has room for 256 idt's, * the global descriptor table is dependent on the number * of tasks we can have.. */#define IDT_ENTRIES 256#define GDT_ENTRIES (__TSS(NR_CPUS)).globl SYMBOL_NAME(idt).globl SYMBOL_NAME(gdt) ALIGN .word 0idt_descr: .word IDT_ENTRIES*8-1 # idt contains 256 entriesSYMBOL_NAME(idt): .long SYMBOL_NAME(idt_table) .word 0gdt_descr: .word GDT_ENTRIES*8-1SYMBOL_NAME(gdt): .long SYMBOL_NAME(gdt_table)/* * This is initialized to create an identity-mapping at 0-8M (for bootup * purposes) and another mapping of the 0-8M area at virtual address * PAGE_OFFSET. */.org 0x1000ENTRY(swapper_pg_dir) // "ENTRY" defined in linux/include/linux/linkage.h .long 0x00102007 .long 0x00103007 .fill BOOT_USER_PGD_PTRS-2,4,0 /* default: 766 entries */ .long 0x00102007 .long 0x00103007 /* default: 254 entries */ .fill BOOT_KERNEL_PGD_PTRS-2,4,0/* * The page tables are initialized to only 8MB here - the final page * tables are set up later depending on memory size. */.org 0x2000ENTRY(pg0).org 0x3000ENTRY(pg1)/* * empty_zero_page must immediately follow the page tables ! (The * initialization loop counts until empty_zero_page) */.org 0x4000ENTRY(empty_zero_page)/* * Real beginning of normal "text" segment */.org 0x5000ENTRY(stext)ENTRY(_stext)////////////////////////////////////////////////////////////////////////////////* * This starts the data section. Note that the above is all * in the text section because it has alignment requirements * that we cannot fulfill any other way. */.dataALIGN/* * This contains typically 140 quadwords, depending on NR_CPUS. * * NOTE! Make sure the gdt descriptor in head.S matches this if you * change anything. */ENTRY(gdt_table) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* not used */ .quad 0x00cf9a000000ffff /* 0x10 kernel 4GB code at 0x00000000 */ .quad 0x00cf92000000ffff /* 0x18 kernel 4GB data at 0x00000000 */ .quad 0x00cffa000000ffff /* 0x23 user 4GB code at 0x00000000 */ .quad 0x00cff2000000ffff /* 0x2b user 4GB data at 0x00000000 */ .quad 0x0000000000000000 /* not used */ .quad 0x0000000000000000 /* not used */ /* * The APM segments have byte granularity and their bases * and limits are set at run time. */ .quad 0x0040920000000000 /* 0x40 APM set up for bad BIOS's */ .quad 0x00409a0000000000 /* 0x48 APM CS code */ .quad 0x00009a0000000000 /* 0x50 APM CS 16 code (16 bit) */ .quad 0x0040920000000000 /* 0x58 APM DS data */ .fill NR_CPUS*4,8,0 /* space for TSS's and LDT's */ |
Macro ALIGN, before idt_descr and gdt_table, is for performance consideration.
6.6. Reference
7. linux/init/main.c
I felt guilty writing this chapter as there are too many documents about it, if not more than enough. start_kernel() supporting functions are changed from version to version, as they depend on OS component internals, which are being improved all the time. I may not have the time for frequent document updates, so I decided to keep this chapter as simple as possible.
7.1. start_kernel()
| ///////////////////////////////////////////////////////////////////////////////asmlinkage void __init start_kernel(void){ char * command_line; extern char saved_command_line[];/* * Interrupts are still disabled. Do necessary setups, then enable them */ lock_kernel(); printk(linux_banner); /* Memory Management in Linux, esp. for setup_arch() * Linux-2.4.4 MM Initialization */ setup_arch(&command_line); printk("Kernel command line: %s\n", saved_command_line); /* linux/Documentation/kernel-parameters.txt * The Linux BootPrompt-HowTo */ parse_options(command_line); trap_init() {#ifdef CONFIG_EISA if (isa_readl(0x0FFFD9) == 'E'+('I'<<8)+('S'<<16)+('A'<<24)) EISA_bus = 1;#endif#ifdef CONFIG_X86_LOCAL_APIC init_apic_mappings();#endif set_xxxx_gate(x, &func); // setup gates cpu_init(); } init_IRQ(); sched_init(); softirq_init() { for (int i=0; i<32: i++) tasklet_init(bh_task_vec+i, bh_action, i); open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL); open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); } time_init(); /* * HACK ALERT! This is early. We're enabling the console before * we've done PCI setups etc, and console_init() must be aware of * this. But we do want output early, in case something goes wrong. */ console_init();#ifdef CONFIG_MODULES init_modules();#endif if (prof_shift) { unsigned int size; /* only text is profiled */ prof_len = (unsigned long) &_etext - (unsigned long) &_stext; prof_len >>= prof_shift; size = prof_len * sizeof(unsigned int) + PAGE_SIZE-1; prof_buffer = (unsigned int *) alloc_bootmem(size); } kmem_cache_init(); sti(); // BogoMips mini-Howto calibrate_delay(); // linux/Documentation/initrd.txt#ifdef CONFIG_BLK_DEV_INITRD if (initrd_start && !initrd_below_start_ok && initrd_start < min_low_pfn << PAGE_SHIFT) { printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - " "disabling it.\n",initrd_start,min_low_pfn << PAGE_SHIFT); initrd_start = 0; }#endif mem_init(); kmem_cache_sizes_init(); pgtable_cache_init(); /* * For architectures that have highmem, num_mappedpages represents * the amount of memory the kernel can use. For other architectures * it's the same as the total pages. We need both numbers because * some subsystems need to initialize based on how much memory the * kernel can use. */ if (num_mappedpages == 0) num_mappedpages = num_physpages; fork_init(num_mempages); proc_caches_init(); vfs_caches_init(num_physpages); buffer_init(num_physpages); page_cache_init(num_physpages);#if defined(CONFIG_ARCH_S390) ccwcache_init();#endif signals_init();#ifdef CONFIG_PROC_FS proc_root_init();#endif#if defined(CONFIG_SYSVIPC) ipc_init();#endif check_bugs(); printk("POSIX conformance testing by UNIFIX\n"); /* * We count on the initial thread going ok * Like idlers init is an unlocked kernel thread, which will * make syscalls (and thus be locked). */ smp_init() {#ifndef CONFIG_SMP# ifdef CONFIG_X86_LOCAL_APIC APIC_init_uniprocessor();# else do { } while (0);# endif#else /* Check Section 8.2. */#endif } rest_init() { // init process, pid = 1 kernel_thread(init, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL); unlock_kernel(); current->need_resched = 1; // idle process, pid = 0 cpu_idle(); // never return }} |
start_kernel() calls rest_init() to spawn an "init" process and become "idle" process itself.
7.2. init()
"Init" process:| ///////////////////////////////////////////////////////////////////////////////static int init(void * unused){ lock_kernel(); do_basic_setup(); prepare_namespace(); /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the * initmem segments and start the user-mode stuff.. */ free_initmem(); unlock_kernel(); if (open("/dev/console", O_RDWR, 0) < 0) // stdin printk("Warning: unable to open an initial console.\n"); (void) dup(0); // stdout (void) dup(0); // stderr /* * We try each of these until one succeeds. * * The Bourne shell can be used instead of init if we are * trying to recover a really broken machine. */ if (execute_command) execve(execute_command,argv_init,envp_init); execve("/sbin/init",argv_init,envp_init); execve("/etc/init",argv_init,envp_init); execve("/bin/init",argv_init,envp_init); execve("/bin/sh",argv_init,envp_init); panic("No init found. Try passing init= option to kernel.");} |
Refer to "man init" or SysVinit for further information on user-mode "init" process.
7.3. cpu_idle()
"Idle" process:| /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a * low exit latency (ie sit in a loop waiting for * somebody to say that they'd like to reschedule) */void cpu_idle (void){ /* endless idle loop with no priority at all */ init_idle(); current->nice = 20; current->counter = -100; while (1) { void (*idle)(void) = pm_idle; if (!idle) idle = default_idle; while (!current->need_resched) idle(); schedule(); check_pgt_cache(); }}///////////////////////////////////////////////////////////////////////////////void __init init_idle(void){ struct schedule_data * sched_data; sched_data = &aligned_data[smp_processor_id()].schedule_data; if (current != &init_task && task_on_runqueue(current)) { printk("UGH! (%d:%d) was on the runqueue, removing.\n", smp_processor_id(), current->pid); del_from_runqueue(current); } sched_data->curr = current; sched_data->last_schedule = get_cycles(); clear_bit(current->processor, &wait_init_idle);}///////////////////////////////////////////////////////////////////////////////void default_idle(void){ if (current_cpu_data.hlt_works_ok && !hlt_counter) { __cli(); if (!current->need_resched) safe_halt(); else __sti(); }}/* defined in linux/include/asm-i386/system.h */#define __cli() __asm__ __volatile__("cli": : :"memory")#define __sti() __asm__ __volatile__("sti": : :"memory")/* used in the idle loop; sti takes one instruction cycle to complete */#define safe_halt() __asm__ __volatile__("sti; hlt": : :"memory") |
CPU will resume code execution with the instruction following "hlt" on the return from an interrupt handler.
7.4. Reference
8. SMP Boot
There are a few SMP related macros, like CONFIG_SMP, CONFIG_X86_LOCAL_APIC, CONFIG_X86_IO_APIC, CONFIG_MULTIQUAD and CONFIG_VISWS. I will ignore code that requires CONFIG_MULTIQUAD or CONFIG_VISWS, which most people don't care (if not using IBM high-end multiprocessor server or SGI Visual Workstation).
BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. Check
MultiProcessor Specification and IA-32 Manual Vol.3 (Ch.7. Multile-Processor Management, and Ch.8. Advanced Programmable Interrupt Controller) for technical details.
8.1. Before smp_init()
Before calling smp_init(), start_kernel() did something to setup SMP environment:| start_kernel()|-- setup_arch()| |-- parse_cmdline_early(); // SMP looks for "noht" and "acpismp=force"| | `-- /* "noht" disables HyperThreading (2 logical cpus per Xeon) */| | if (!memcmp(from, "noht", 4)) {| | disable_x86_ht = 1;| | set_bit(X86_FEATURE_HT, disabled_x86_caps);| | }| | /* "acpismp=force" forces parsing and use of the ACPI SMP table */| | else if (!memcmp(from, "acpismp=force", 13))| | enable_acpi_smp_table = 1;| |-- setup_memory(); // reserve memory for MP configuration table| | |-- reserve_bootmem(PAGE_SIZE, PAGE_SIZE);| | `-- find_smp_config();| | `-- find_intel_smp();| | `-- smp_scan_config();| | |-- set flag smp_found_config| | |-- set MP floating pointer mpf_found| | `-- reserve_bootmem(mpf_found, PAGE_SIZE);| |-- if (disable_x86_ht) { // if HyperThreading feature disabled| | clear_bit(X86_FEATURE_HT, &boot_cpu_data.x86_capability[0]);| | set_bit(X86_FEATURE_HT, disabled_x86_caps);| | enable_acpi_smp_table = 0;| | }| |-- if (test_bit(X86_FEATURE_HT, &boot_cpu_data.x86_capability[0]))| | enable_acpi_smp_table = 1;| |-- smp_alloc_memory();| | `-- /* reserve AP processor's real-mode code space in low memory */| | trampoline_base = (void *) alloc_bootmem_low_pages(PAGE_SIZE);| `-- get_smp_config(); /* get boot-time MP configuration */| |-- config_acpi_tables();| | |-- memset(&acpi_boot_ops, 0, sizeof(acpi_boot_ops));| | |-- acpi_boot_ops[ACPI_APIC] = acpi_parse_madt;| | `-- /* Set have_acpi_tables to indicate using| | * MADT in the ACPI tables; Use MPS tables if failed. */| | if (enable_acpi_smp_table && !acpi_tables_init())| | have_acpi_tables = 1;| |-- set pic_mode| | /* =1, if the IMCR is present and PIC Mode is implemented;| | * =0, otherwise Virtual Wire Mode is implemented. */| |-- save local APIC address in mp_lapic_addr| `-- scan for MP configuration table entries, like| MP_PROCESSOR, MP_BUS, MP_IOAPIC, MP_INTSRC and MP_LINTSRC.|-- trap_init();| `-- init_apic_mappings(); // setup PTE for APIC| |-- /* If no local APIC can be found then set up a fake all| | * zeroes page to simulate the local APIC and another| | * one for the IO-APIC. */| | if (!smp_found_config && detect_init_APIC()) {| | apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE);| | apic_phys = __pa(apic_phys);| | } else| | apic_phys = mp_lapic_addr;| |-- /* map local APIC address,| | * mp_lapic_addr (0xfee00000) in most case,| | * to linear address FIXADDR_TOP (0xffffe000) */| | set_fixmap_nocache(FIX_APIC_BASE, apic_phys);| |-- /* Fetch the APIC ID of the BSP in case we have a| | * default configuration (or the MP table is broken). */| | if (boot_cpu_physical_apicid == -1U)| | boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));| `-- // map IOAPIC address to uncacheable linear address| set_fixmap_nocache(idx, ioapic_phys);| // Now we can use linear address to access APIC space.|-- init_IRQ();| |-- init_ISA_irqs();| | |-- /* An initial setup of the virtual wire mode. */| | | init_bsp_APIC();| | `-- init_8259A(auto_eoi=0);| `-- setup SMP/APIC interrupt handlers, esp. IPI.`-- mem_init(); `-- /* delay zapping low mapping entries for SMP: zap_low_mappings() */ |
IPI (InterProcessor Interrupt), CPU-to-CPU interrupt through local APIC, is the mechanism used by BSP to trigger APs.
Be aware that "one local APIC per CPU is required" in an MP-compliant system. Processors do not share APIC local units address space (physical address 0xFEE00000 - 0xFEEFFFFF), but will share APIC I/O units (0xFEC00000 - 0xFECFFFFF). Both address spaces are uncacheable.
8.2. smp_init()
BSP calls start_kernel() -> smp_init() -> smp_boot_cpus() to setup data structures for each CPU and activate the rest APs.| ///////////////////////////////////////////////////////////////////////////////static void __init smp_init(void){ /* Get other processors into their bootup holding patterns. */ smp_boot_cpus(); wait_init_idle = cpu_online_map; clear_bit(current->processor, &wait_init_idle); /* Don't wait on me! */ smp_threads_ready=1; smp_commence() { /* Lets the callins below out of their loop. */ Dprintk("Setting commenced=1, go go go\n"); wmb(); atomic_set(&smp_commenced,1); } /* Wait for the other cpus to set up their idle processes */ printk("Waiting on wait_init_idle (map = 0x%lx)\n", wait_init_idle); while (wait_init_idle) { cpu_relax(); // i.e. "rep;nop" barrier(); } printk("All processors have done init_idle\n");}///////////////////////////////////////////////////////////////////////////////void __init smp_boot_cpus(void){ // ... something not very interesting :-) /* Initialize the logical to physical CPU number mapping * and the per-CPU profiling router/multiplier */ prof_counter[0..NR_CPUS-1] = 0; prof_old_multiplier[0..NR_CPUS-1] = 0; prof_multiplier[0..NR_CPUS-1] = 0; init_cpu_to_apicid() { physical_apicid_2_cpu[0..MAX_APICID-1] = -1; logical_apicid_2_cpu[0..MAX_APICID-1] = -1; cpu_2_physical_apicid[0..NR_CPUS-1] = 0; cpu_2_logical_apicid[0..NR_CPUS-1] = 0; } /* Setup boot CPU information */ smp_store_cpu_info(0); /* Final full version of the data */ printk("CPU%d: ", 0); print_cpu_info(&cpu_data[0]); /* We have the boot CPU online for sure. */ set_bit(0, &cpu_online_map); boot_cpu_logical_apicid = logical_smp_processor_id() { GET_APIC_LOGICAL_ID(*(unsigned long *)(APIC_BASE+APIC_LDR)); } map_cpu_to_boot_apicid(0, boot_cpu_apicid) { physical_apicid_2_cpu[boot_cpu_apicid] = 0; cpu_2_physical_apicid[0] = boot_cpu_apicid; } global_irq_holder = 0; current->processor = 0; init_idle(); // will clear corresponding bit in wait_init_idle smp_tune_scheduling(); // ... some conditions checked connect_bsp_APIC(); // enable APIC mode if used to be PIC mode setup_local_APIC(); if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_physical_apicid) BUG(); /* Scan the CPU present map and fire up the other CPUs * via do_boot_cpu() */ Dprintk("CPU present map: %lx\n", phys_cpu_present_map); for (bit = 0; bit < NR_CPUS; bit++) { apicid = cpu_present_to_apicid(bit); /* Don't even attempt to start the boot CPU! */ if (apicid == boot_cpu_apicid) continue; if (!(phys_cpu_present_map & (1 << bit))) continue; if ((max_cpus >= 0) && (max_cpus <= cpucount+1)) continue; do_boot_cpu(apicid); /* Make sure we unmap all failed CPUs */ if ((boot_apicid_to_cpu(apicid) == -1) && (phys_cpu_present_map & (1 << bit))) printk("CPU #%d not responding - cannot use it.\n", apicid); } // ... SMP BogoMIPS // ... B stepping processor warning // ... HyperThreading handling /* Set up all local APIC timers in the system */ setup_APIC_clocks(); /* Synchronize the TSC with the AP */ if (cpu_has_tsc && cpucount) synchronize_tsc_bp();smp_done: zap_low_mappings();}///////////////////////////////////////////////////////////////////////////////static void __init do_boot_cpu (int apicid){ cpu = ++cpucount; // 1. prepare "idle process" task struct for next AP /* We can't use kernel_thread since we must avoid to * reschedule the child. */ if (fork_by_hand() < 0) panic("failed fork for CPU %d", cpu); /* We remove it from the pidhash and the runqueue * once we got the process: */ idle = init_task.prev_task; if (!idle) panic("No idle process for CPU %d", cpu); /* we schedule the first task manually */ idle->processor = cpu; idle->cpus_runnable = 1 << cpu; // only on this AP! map_cpu_to_boot_apicid(cpu, apicid) { physical_apicid_2_cpu[apicid] = cpu; cpu_2_physical_apicid[cpu] = apicid; } idle->thread.eip = (unsigned long) start_secondary; del_from_runqueue(idle); unhash_process(idle); init_tasks[cpu] = idle; // 2. prepare stack and code (CS:IP) for next AP /* start_eip had better be page-aligned! */ start_eip = setup_trampoline() { memcpy(trampoline_base, trampoline_data, trampoline_end - trampoline_data); /* trampoline_base was reserved in * start_kernel() -> setup_arch() -> smp_alloc_memory(), * and will be shared by all APs (one by one) */ return virt_to_phys(trampoline_base); } /* So we see what's up */ printk("Booting processor %d/%d eip %lx\n", cpu, apicid, start_eip); stack_start.esp = (void *) (1024 + PAGE_SIZE + (char *)idle); /* this value is used by next AP when it executes * "lss stack_start,%esp" in * linux/arch/i386/kernel/head.S:startup_32(). */ /* This grunge runs the startup process for * the targeted processor. */ atomic_set(&init_deasserted, 0); Dprintk("Setting warm reset code and vector.\n"); CMOS_WRITE(0xa, 0xf); local_flush_tlb(); Dprintk("1.\n"); *((volatile unsigned short *) TRAMPOLINE_HIGH) = start_eip >> 4; Dprintk("2.\n"); *((volatile unsigned short *) TRAMPOLINE_LOW) = start_eip & 0xf; Dprintk("3.\n"); // we have setup 0:467 to start_eip (trampoline_base) // 3. kick AP to run (AP gets CS:IP from 0:467) // Starting actual IPI sequence... boot_error = wakeup_secondary_via_INIT(apicid, start_eip); if (!boot_error) { // looks OK /* allow APs to start initializing. */ set_bit(cpu, &cpu_callout_map); /* ... Wait 5s total for a response */ // bit cpu in cpu_callin_map is set by AP in smp_callin() if (test_bit(cpu, &cpu_callin_map)) { print_cpu_info(&cpu_data[cpu]); } else { boot_error= 1; // marker 0xA5 set by AP in trampoline_data() if (*((volatile unsigned char *)phys_to_virt(8192)) == 0xA5) /* trampoline started but... */ printk("Stuck ??\n"); else /* trampoline code not run */ printk("Not responding.\n"); } } if (boot_error) { /* Try to put things back the way they were before ... */ unmap_cpu_to_boot_apicid(cpu, apicid); clear_bit(cpu, &cpu_callout_map); /* set in do_boot_cpu() */ clear_bit(cpu, &cpu_initialized); /* set in cpu_init() */ clear_bit(cpu, &cpu_online_map); /* set in smp_callin() */ cpucount--; } /* mark "stuck" area as not stuck */ *((volatile unsigned long *)phys_to_virt(8192)) = 0;} |
Don't confuse start_secondary() with trampoline_data(). The former is AP "idle" process task struct EIP value, and the latter is the real-mode code that AP runs after BSP kicks it (using wakeup_secondary_via_INIT()).
8.3. linux/arch/i386/kernel/trampoline.S
This file contains the 16-bit real-mode AP startup code. BSP reserved memory space trampoline_base in start_kernel() -> setup_arch() -> smp_alloc_memory(). Before BSP triggers AP, it copies the trampoline code, between trampoline_data and trampoline_end, to trampoline_base (in do_boot_cpu() -> setup_trampoline()). BSP sets up 0:467 to point to trampoline_base, so that AP will run from here.
| ///////////////////////////////////////////////////////////////////////////////trampoline_data(){r_base: wbinvd; // Needed for NUMA-Q should be harmless for other DS = CS; BX = 1; // Flag an SMP trampoline cli; // write marker for master knows we're running trampoline_base = 0xA5A5A5A5; lidt idt_48; lgdt gdt_48; AX = 1; lmsw AX; // protected mode! goto flush_instr;flush_instr: goto CS:100000; // see linux/arch/i386/kernel/head.S:startup_32()}idt_48: .word 0 # idt limit = 0 .word 0, 0 # idt base = 0Lgdt_48: .word 0x0800 # gdt limit = 2048, 256 GDT entries .long gdt_table-__PAGE_OFFSET # gdt base = gdt (first SMP CPU).globl SYMBOL_NAME(trampoline_end)SYMBOL_NAME_LABEL(trampoline_end) |
Note that BX=1 when AP jumps to linux/arch/i386/kernel/head.S:startup_32(), which is different from that of BSP (BX=0). See Section 6.
8.4. initialize_secondary()
Unlike BSP, at the end of linux/arch/i386/kernel/head.S:startup_32() in Section 6.4, AP will call initialize_secondary() instead of start_kernel().
| /* Everything has been set up for the secondary * CPUs - they just need to reload everything * from the task structure * This function must not return. */void __init initialize_secondary(void){ /* We don't actually need to load the full TSS, * basically just the stack pointer and the eip. */ asm volatile( "movl %0,%%esp\n\t" "jmp *%1" : :"r" (current->thread.esp),"r" (current->thread.eip));} |
As BSP called do_boot_cpu() to set thread.eip to start_secondary(), control of AP is passed to this function. AP uses a new stack frame, which was set up by BSP in do_boot_cpu() -> fork_by_hand() -> do_fork().
8.5. start_secondary()
All APs wait for signal smp_commenced from BSP, triggered in Section 8.2 smp_init() -> smp_commence(). After getting this signal, they will run "idle" processes.| ///////////////////////////////////////////////////////////////////////////////int __init start_secondary(void *unused){ /* Dont put anything before smp_callin(), SMP * booting is too fragile that we want to limit the * things done here to the most necessary things. */ cpu_init(); smp_callin(); while (!atomic_read(&smp_commenced)) rep_nop(); /* low-memory mappings have been cleared, flush them from * the local TLBs too. */ local_flush_tlb(); return cpu_idle(); // never return, see Section 7.3} |
cpu_idle() -> init_idle() will clear corresponding bit in wait_init_idle, and finally make BSP finish smp_init() and continue with the following function in start_kernel() (i.e. rest_init()).
8.6. Reference
MultiProcessor Specification
IA-32 Intel Architecture Software Developer's Manual
Linux Kernel 2.4 Internals: Ch.1.7. SMP Bootup on x86
Linux SMP HOWTO
ACPI spec
An Implementation Of Multiprocessor Linux: linux/Documentation/smp.tex
A. Kernel Build Example
Here is a kernel build example (in Redhat 9.0). Statements between "/*" and "*/" are in-line comments, not console output.| [root@localhost root]# ln -s /usr/src/linux-2.4.20 /usr/src/linux[root@localhost root]# cd /usr/src/linux[root@localhost linux]# make xconfig /* Create .config * 1. "Load Configuration from File" -> * /boot/config-2.4.20-28.9, or whatever you like * 2. Modify kernel configuration parameters * 3. "Save and Exit" */[root@localhost linux]# make oldconfig /* Re-check .config, optional */[root@localhost linux]# vi Makefile /* Modify EXTRAVERSION in linux/Makefile, optional */[root@localhost linux]# make dep /* Create .depend and more */[root@localhost linux]# make bzImage /* ... Some output omitted */ld -m elf_i386 -T /usr/src/linux-2.4.20/arch/i386/vmlinux.lds -e stext arch/i386/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o init/do_mounts.o \ --start-group \ arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/fs.o ipc/ipc.o \ drivers/char/char.o drivers/block/block.o drivers/misc/misc.o drivers/net/net.o drivers/media/media.o drivers/char/drm/drm.o drivers/net/fc/fc.o drivers/net/appletalk/appletalk.o drivers/net/tokenring/tr.o drivers/net/wan/wan.o drivers/atm/atm.o drivers/ide/idedriver.o drivers/cdrom/driver.o drivers/pci/driver.o drivers/net/pcmcia/pcmcia_net.o drivers/net/wireless/wireless_net.o drivers/pnp/pnp.o drivers/video/video.o drivers/net/hamradio/hamradio.o drivers/md/mddev.o drivers/isdn/vmlinux-obj.o \ net/network.o \ /usr/src/linux-2.4.20/arch/i386/lib/lib.a /usr/src/linux-2.4.20/lib/lib.a /usr/src/linux-2.4.20/arch/i386/lib/lib.a \ --end-group \ -o vmlinuxnm vmlinux | grep -v '\(compiled\)\|\(\.o$\)\|\( [aUw] \)\|\(\.\.ng$\)\|\(LASH[RL]DI\)' | sort > System.mapmake[1]: Entering directory `/usr/src/linux-2.4.20/arch/i386/boot'gcc -E -D__KERNEL__ -I/usr/src/linux-2.4.20/include -D__BIG_KERNEL__ -traditional -DSVGA_MODE=NORMAL_VGA bootsect.S -o bbootsect.sas -o bbootsect.o bbootsect.sbootsect.S: Assembler messages:bootsect.S:239: Warning: indirect lcall without `*'ld -m elf_i386 -Ttext 0x0 -s --oformat binary bbootsect.o -o bbootsectgcc -E -D__KERNEL__ -I/usr/src/linux-2.4.20/include -D__BIG_KERNEL__ -D__ASSEMBLY__ -traditional -DSVGA_MODE=NORMAL_VGA setup.S -o bsetup.sas -o bsetup.o bsetup.ssetup.S: Assembler messages:setup.S:230: Warning: indirect lcall without `*'ld -m elf_i386 -Ttext 0x0 -s --oformat binary -e begtext -o bsetup bsetup.omake[2]: Entering directory `/usr/src/linux-2.4.20/arch/i386/boot/compressed'tmppiggy=_tmp_$$piggy; \rm -f $tmppiggy $tmppiggy.gz $tmppiggy.lnk; \objcopy -O binary -R .note -R .comment -S /usr/src/linux-2.4.20/vmlinux $tmppiggy; \gzip -f -9 < $tmppiggy > $tmppiggy.gz; \echo "SECTIONS { .data : { input_len = .; LONG(input_data_end - input_data) input_data = .; *(.data) input_data_end = .; }}" > $tmppiggy.lnk; \ld -m elf_i386 -r -o piggy.o -b binary $tmppiggy.gz -b elf32-i386 -T $tmppiggy.lnk; \rm -f $tmppiggy $tmppiggy.gz $tmppiggy.lnkgcc -D__ASSEMBLY__ -D__KERNEL__ -I/usr/src/linux-2.4.20/include -traditional -chead.Sgcc -D__KERNEL__ -I/usr/src/linux-2.4.20/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 -DKBUILD_BASENAME=misc -c misc.cld -m elf_i386 -Ttext 0x100000 -e startup_32 -o bvmlinux head.o misc.o piggy.omake[2]: Leaving directory `/usr/src/linux-2.4.20/arch/i386/boot/compressed'gcc -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -o tools/build tools/build.c -I/usr/src/linux-2.4.20/includeobjcopy -O binary -R .note -R .comment -S compressed/bvmlinux compressed/bvmlinux.outtools/build -b bbootsect bsetup compressed/bvmlinux.out CURRENT > bzImageRoot device is (3, 67)Boot sector 512 bytes.Setup is 4780 bytes.System is 852 kBmake[1]: Leaving directory `/usr/src/linux-2.4.20/arch/i386/boot'[root@localhost linux]# make modules modules_install /* ... Some output omitted */cd /lib/modules/2.4.20; \mkdir -p pcmcia; \find kernel -path '*/pcmcia/*' -name '*.o' | xargs -i -r ln -sf ../{} pcmciaif [ -r System.map ]; then /sbin/depmod -ae -F System.map 2.4.20; fi[root@localhost linux]# cp arch/i386/boot/bzImage /boot/vmlinuz-2.4.20[root@localhost linux]# cp vmlinux /boot/vmlinux-2.4.20[root@localhost linux]# cp System.map /boot/System.map-2.4.20[root@localhost linux]# cp .config /boot/config-2.4.20[root@localhost linux]# mkinitrd /boot/initrd-2.4.20.img 2.4.20[root@localhost linux]# vi /boot/grub/grub.conf /* Add the following lines to grub.conf:title Linux (2.4.20) kernel /vmlinuz-2.4.20 ro root=LABEL=/ initrd /initrd-2.4.20.img */ |
Refer to
Kernelnewbies FAQ: How do I compile a kernel and
Kernel Rebuild Procedure for more details.
To build the kernel in Debian, also refer toDebian Installation Manual: Compiling a New Kernel,The Debian GNU/Linux FAQ: Debian and the kernel andDebian Reference: The Linux kernel under Debian. Check "zless /usr/share/doc/kernel-package/Problems.gz" if you encounter problems.
B. Internal Linker Script
Without -T (--script=) option specified, ld will use this builtin script to link targets:| [root@localhost linux]# ld --verboseGNU ld version 2.13.90.0.18 20030206 Supported emulations: elf_i386 i386linuxusing internal linker script:==================================================/* Script for -z combreloc: combine and sort reloc sections */OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")OUTPUT_ARCH(i386)ENTRY(_start)SEARCH_DIR("/usr/i386-redhat-linux/lib"); SEARCH_DIR("/usr/lib"); SEARCH_DIR("/usr/local/lib"); SEARCH_DIR("/lib");/* Do we need any of these for elf? __DYNAMIC = 0; */SECTIONS{ /* Read-only sections, merged into text segment: */ . = 0x08048000 + SIZEOF_HEADERS; .interp : { *(.interp) } .hash : { *(.hash) } .dynsym : { *(.dynsym) } .dynstr : { *(.dynstr) } .gnu.version : { *(.gnu.version) } .gnu.version_d : { *(.gnu.version_d) } .gnu.version_r : { *(.gnu.version_r) } .rel.dyn : { *(.rel.init) *(.rel.text .rel.text.* .rel.gnu.linkonce.t.*) *(.rel.fini) *(.rel.rodata .rel.rodata.* .rel.gnu.linkonce.r.*) *(.rel.data .rel.data.* .rel.gnu.linkonce.d.*) *(.rel.tdata .rel.tdata.* .rel.gnu.linkonce.td.*) *(.rel.tbss .rel.tbss.* .rel.gnu.linkonce.tb.*) *(.rel.ctors) *(.rel.dtors) *(.rel.got) *(.rel.bss .rel.bss.* .rel.gnu.linkonce.b.*) } .rela.dyn : { *(.rela.init) *(.rela.text .rela.text.* .rela.gnu.linkonce.t.*) *(.rela.fini) *(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*) *(.rela.data .rela.data.* .rela.gnu.linkonce.d.*) *(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*) *(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*) *(.rela.ctors) *(.rela.dtors) *(.rela.got) *(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*) } .rel.plt : { *(.rel.plt) } .rela.plt : { *(.rela.plt) } .init : { KEEP (*(.init)) } =0x90909090 .plt : { *(.plt) } .text : { *(.text .stub .text.* .gnu.linkonce.t.*) /* .gnu.warning sections are handled specially by elf32.em. */ *(.gnu.warning) } =0x90909090 .fini : { KEEP (*(.fini)) } =0x90909090 PROVIDE (__etext = .); PROVIDE (_etext = .); PROVIDE (etext = .); .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) } .rodata1 : { *(.rodata1) } .eh_frame_hdr : { *(.eh_frame_hdr) } .eh_frame : ONLY_IF_RO { KEEP (*(.eh_frame)) } .gcc_except_table : ONLY_IF_RO { *(.gcc_except_table) } /* Adjust the address for the data segment. We want to adjust up to the same address within the page on the next page up. */ . = ALIGN (0x1000) - ((0x1000 - .) & (0x1000 - 1)); . = DATA_SEGMENT_ALIGN (0x1000, 0x1000); /* For backward-compatibility with tools that don't support the *_array_* sections below, our glibc's crt files contain weak definitions of symbols that they reference. We don't want to use them, though, unless they're strictly necessary, because they'd bring us empty sections, unlike PROVIDE below, so we drop the sections from the crt files here. */ /DISCARD/ : { */crti.o(.init_array .fini_array .preinit_array) */crtn.o(.init_array .fini_array .preinit_array) } /* Ensure the __preinit_array_start label is properly aligned. We could instead move the label definition inside the section, but the linker would then create the section even if it turns out to be empty, which isn't pretty. */ . = ALIGN(32 / 8); PROVIDE (__preinit_array_start = .); .preinit_array : { *(.preinit_array) } PROVIDE (__preinit_array_end = .); PROVIDE (__init_array_start = .); .init_array : { *(.init_array) } PROVIDE (__init_array_end = .); PROVIDE (__fini_array_start = .); .fini_array : { *(.fini_array) } PROVIDE (__fini_array_end = .); .data : { *(.data .data.* .gnu.linkonce.d.*) SORT(CONSTRUCTORS) } .data1 : { *(.data1) } .tdata : { *(.tdata .tdata.* .gnu.linkonce.td.*) } .tbss : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) } .eh_frame : ONLY_IF_RW { KEEP (*(.eh_frame)) } .gcc_except_table : ONLY_IF_RW { *(.gcc_except_table) } .dynamic : { *(.dynamic) } .ctors : { /* gcc uses crtbegin.o to find the start of the constructors, so we make sure it is first. Because this is a wildcard, it doesn't matter if the user does not actually link against crtbegin.o; the linker won't look for a file to match a wildcard. The wildcard also means that it doesn't matter which directory crtbegin.o is in. */ KEEP (*crtbegin.o(.ctors)) /* We don't want to include the .ctor section from from the crtend.o file until after the sorted ctors. The .ctor section from the crtend file contains the end of ctors marker and it must be last */ KEEP (*(EXCLUDE_FILE (*crtend.o ) .ctors)) KEEP (*(SORT(.ctors.*))) KEEP (*(.ctors)) } .dtors : { KEEP (*crtbegin.o(.dtors)) KEEP (*(EXCLUDE_FILE (*crtend.o ) .dtors)) KEEP (*(SORT(.dtors.*))) KEEP (*(.dtors)) } .jcr : { KEEP (*(.jcr)) } .got : { *(.got.plt) *(.got) } _edata = .; PROVIDE (edata = .); __bss_start = .; .bss : { *(.dynbss) *(.bss .bss.* .gnu.linkonce.b.*) *(COMMON) /* Align here to ensure that the .bss section occupies space up to _end. Align after .bss to ensure correct alignment even if the .bss section disappears because there are no input sections. */ . = ALIGN(32 / 8); } . = ALIGN(32 / 8); _end = .; PROVIDE (end = .); . = DATA_SEGMENT_END (.); /* Stabs debugging sections. */ .stab 0 : { *(.stab) } .stabstr 0 : { *(.stabstr) } .stab.excl 0 : { *(.stab.excl) } .stab.exclstr 0 : { *(.stab.exclstr) } .stab.index 0 : { *(.stab.index) } .stab.indexstr 0 : { *(.stab.indexstr) } .comment 0 : { *(.comment) } /* DWARF debug sections. Symbols in the DWARF debugging sections are relative to the beginning of the section so we begin them at 0. */ /* DWARF 1 */ .debug 0 : { *(.debug) } .line 0 : { *(.line) } /* GNU DWARF 1 extensions */ .debug_srcinfo 0 : { *(.debug_srcinfo) } .debug_sfnames 0 : { *(.debug_sfnames) } /* DWARF 1.1 and DWARF 2 */ .debug_aranges 0 : { *(.debug_aranges) } .debug_pubnames 0 : { *(.debug_pubnames) } /* DWARF 2 */ .debug_info 0 : { *(.debug_info .gnu.linkonce.wi.*) } .debug_abbrev 0 : { *(.debug_abbrev) } .debug_line 0 : { *(.debug_line) } .debug_frame 0 : { *(.debug_frame) } .debug_str 0 : { *(.debug_str) } .debug_loc 0 : { *(.debug_loc) } .debug_macinfo 0 : { *(.debug_macinfo) } /* SGI/MIPS DWARF 2 extensions */ .debug_weaknames 0 : { *(.debug_weaknames) } .debug_funcnames 0 : { *(.debug_funcnames) } .debug_typenames 0 : { *(.debug_typenames) } .debug_varnames 0 : { *(.debug_varnames) }}==================================================[root@localhost linux]# |
C. GRUB and LILO
Both GNU GRUB and LILO understand the real-mode kernel header format and will load the bootsect (one sector), setup code (setup_sects sectors) and compressed kernel image (syssize*16 bytes) into memory. They fill out the loader identifier (type_of_loader) and try to pass appropriate parameters and options to the kernel. After they finish their jobs, control is passed to setup code.
C.1. GNU GRUB
The following GNU GRUB program outline is based on grub-0.93.| stage2/stage2.c:cmain()`-- run_menu() `-- run_script(); |-- builtin = find_command(heap); |-- kernel_func(); // builtin->func() for command "kernel" | `-- load_image(); // search BOOTSEC_SIGNATURE in boot.c | /* memory from 0x100000 is populated by and in the order of | * (bvmlinux, bbootsect, bsetup) or (vmlinux, bootsect, setup) */ |-- initrd_func(); // for command "initrd" | `-- load_initrd(); `-- boot_func(); // for implicit command "boot" `-- linux_boot(); // defined in stage2/asm.S or big_linux_boot(); // not in grub/asmstub.c!// In stage2/asm.Slinux_boot: /* copy kernel */ move system code from 0x100000 to 0x10000 (linux_text_len bytes);big_linux_boot: /* copy the real mode part */ EBX = linux_data_real_addr; move setup code from linux_data_tmp_addr (0x100000+text_len) to linux_data_real_addr (0x9100 bytes); /* change %ebx to the segment address */ linux_setup_seg = (EBX >> 4) + 0x20; /* XXX new stack pointer in safe area for calling functions */ ESP = 0x4000; stop_floppy(); /* final setup for linux boot */ prot_to_real(); cli; SS:ESP = BX:9000; DS = ES = FS = GS = BX; /* jump to start, i.e. ljmp linux_setup_seg:0 * Note that linux_setup_seg is just changed to BX. */ .byte 0xea .word 0linux_setup_seg: .word 0 |
Refer to "info grub" for GRUB manual.
One
reported GNU GRUB bug should be noted if you are porting grub-0.93 and making changes to bsetup.
C.2. LILO
Unlike GRUB, LILO does not check the configuration file when booting system. Tricks happen when lilo is invoked from terminal.
The following LILO program outline is based on lilo-22.5.8.| lilo.c:main()|-- cfg_open(config_file);|-- cfg_parse(cf_options);|-- bsect_open(boot_dev, map_file, install, delay, timeout);| |-- open_bsect(boot_dev);| `-- map_create(map_file);|-- cfg_parse(cf_top)| `-- cfg_do_set();| `-- do_image(); // walk->action for "image=" section| |-- cfg_parse(cf_image) -> cfg_do_set();| |-- bsect_common(&descr, 1);| | |-- map_begin_section();| | |-- map_add_sector(fallback_buf);| | `-- map_add_sector(options);| |-- boot_image(name, &descr) or boot_device(name, range, &descr);| | |-- int fd = geo_open(&descr, name, O_RDONLY);| | | read(fd, &buff, SECTOR_SIZE);| | | map_add(&geo, 0, image_sectors);| | | map_end_section(&descr->start, setup_sects+2+1);| | | /* two sectors created in bsect_common(),| | | * another one sector for bootsect */| | | geo_close(&geo);| | `-- fd = geo_open(&descr, initrd, O_RDONLY);| | map_begin_section();| | map_add(&geo, 0, initrd_sectors);| | map_end_section(&descr->initrd,0);| | geo_close(&geo);| `-- bsect_done(name, &descr);`-- bsect_update(backup_file, force_backup, 0); // update boot sector |-- make_backup(); |-- map_begin_section(); | map_add_sector(table); | map_write(¶m2, keytab, 0, 0); | map_close(¶m2, here2); |-- // ... perform the relocation of the boot sector |-- // ... setup bsect_wr to correct place |-- write(fd, bsect_wr, SECTOR_SIZE); `-- close(fd); |
map_add(), map_add_sector() and map_add_zero() may call map_register() to complete their jobs, while map_register() will keep a list for all (CX, DX, AL) triplets (data structure SECTOR_ADDR) used to identify all registered sectors.
LILO runs first.S and second.S to boot a system. It calls second.S:doboot() to load map file, bootsect and setup code. Then it calls lfile() to load the system code, calls launch2() -> launch() -> cl_wait() -> start_setup() -> start_setup2() and finnaly executes "jmpi 0,SETUPSEG" instruction to run setup code.
Refer to "man lilo" and "man lilo.conf" for LILO details.
C.3. Reference
D. FAQ
For things that are to be in appropriate chapters, or should be here. /* TODO: */