v很久没有更新博客了(博客园怎么还不更新后台),前几天在写一个Linux 0.11的实验 [1] 时遇到了一个奇葩的Bug,就在这简单记录一下调试过程吧。

 

现象

这个实验要求在Linux 0.11中实现简单的信号量 [2],但在改动内核代码后运行测试程序总是报错,例如:

/* pc_test.c */  #define   __LIBRARY__ #include <stdio.h> #include <stdlib.h> #include <semaphore.h> #include <unistd.h>  _syscall2(long, sem_open, const char *, name, unsigned int, value); _syscall1(int, sem_unlink, const char *, name);  int main(void) {     sem_t *mutex;     if ((mutex = (sem_t *) sem_open("mutex", 1)) == (sem_t *)-1)     {         perror("opening mutex semaphore");         return EXIT_FAILURE;     }       sem_unlink("mutex");          return EXIT_SUCCESS; } 

提示为段错误:

 

定位

在内核实现信号量的核心代码 sem.c 中插桩调试,最终把发生段错误的位置定在寻找已存在信号量的 find_sem 函数中:

/*  以下注释部分是semaphore.h中我定义的链表结构体  #define MAXSEMNAME 128 struct sem_t {     char m_name[MAXSEMNAME+1];     unsigned long m_value;      struct sem_t * m_prev;     struct sem_t * m_next;      struct task_struct * m_wait; };  typedef struct sem_t sem_t;  #define SEM_FAILED ((sem_t *)-1) */  // Data structure optimization is possible here sem_t _semHead={.m_name = "_semHead", .m_value = 0, .m_prev = NULL,\                  .m_next = NULL, .m_wait = NULL};  sem_t *find_sem(const char* name) {     sem_t *tmpSemP = &_semHead;     while (tmpSemP->m_next != NULL)     {         if (strcmp((tmpSemP->m_name), name) == 0)         {             return tmpSemP;         }         tmpSemP = tmpSemP->m_next;     }     return tmpSemP; }

 

由于该函数中存在 P->member 这样的解引用操作,很大概率就是P的值出了问题,所以就在P对应的操作附近加上 printk ,判断是否是P的值出了问题:

sem_t *find_sem(const char* name) {     printk("Now we are in find_sem\n"); // DEBUG     sem_t *tmpSemP = &_semHead;     while (tmpSemP->m_next != NULL)     {         printk("find_sem: tmpSemp before strcmp: %p\n", tmpSemP);         if (