Linux內核調試技術——進程D狀態死鎖檢測

作者：圍城 2016-08-23 09:17:08

所謂的D狀態就是Linux的進程存在多種狀態，如TASK_RUNNING的運行態、EXIT_DEAD的停止態和TASK_INTERRUPTIBLE的接收信號的等待狀態等等（可在include/linux/sched.h中查看）。其中有一種狀態等待為TASK_UNINTERRUPTIBLE，稱為D狀態。

Linux的進程存在多種狀態，如TASK_RUNNING的運行態、EXIT_DEAD的停止態和TASK_INTERRUPTIBLE的接收信號的等待狀態等等(可在include/linux/sched.h中查看)。其中有一種狀態等待為TASK_UNINTERRUPTIBLE，稱為D狀態，該種狀態下進程不接收信號，只能通過wake_up喚醒。處于這種狀態的情況有很多，例如mutex鎖就可能會設置進程于該狀態，有時候進程在等待某種IO資源就緒時(wait_event機制)會設置進程進入該狀態。一般情況下，進程處于該狀態的時間不會太久，但若IO設備出現故障或者出現進程死鎖等情況，進程就可能長期處于該狀態而無法再返回到TASK_RUNNING態。因此，內核為了便于發現這類情況設計出了hung task機制專門用于檢測長期處于D狀態的進程并發出告警。本文分析內核hung task機制的源碼并給出一個示例演示。

一、hung task機制分析

內核在很早的版本中就已經引入了hung task機制，本文以較新的Linux 4.1.15版本源碼為例進行分析，代碼量并不多，源代碼文件為kernel/hung_task.c。

首先給出整體流程框圖和設計思想：

圖 D狀態死鎖流程圖

其核心思想為創建一個內核監測進程循環監測處于D狀態的每一個進程(任務)，統計它們在兩次檢測之間的調度次數，如果發現有任務在兩次監測之間沒有發生任何的調度則可判斷該進程一直處于D狀態，很有可能已經死鎖，因此觸發報警日志打印，輸出進程的基本信息，棧回溯以及寄存器保存信息以供內核開發人員定位。

下面詳細分析實現方式：

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
static int __init hung_task_init(void)   
{   
    atomic_notifier_chain_register(&panic_notifier_list, &panic_block);   
    watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");   
   
    return 0;   
}   
subsys_initcall(hung_task_init);

首先，若在內核配置中啟用了該機制，在內核的subsys初始化階段就會調用hung_task_init()函數啟用功能，首先向內核的panic_notifier_list通知鏈注冊回調：

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
static struct notifier_block panic_block = {   
    .notifier_call = hung_task_panic,   
};

在內核觸發panic時就會調用該hung_task_panic()函數，這個函數的作用稍后再看。繼續往下初始化，調用kthread_run()函數創建了一個名為khungtaskd的線程，執行watchdog()函數，立即嘗試調度執行。該線程就是專用于檢測D狀態死鎖進程的后臺內核線程。

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
/*  
 * kthread which checks for tasks stuck in D state  
 */   
static int watchdog(void *dummy)   
{   
    set_user_nice(current, 0);   
   
    for ( ; ; ) {   
        unsigned long timeout = sysctl_hung_task_timeout_secs;   
   
        while (schedule_timeout_interruptible(timeout_jiffies(timeout)))   
            timeout = sysctl_hung_task_timeout_secs;   
   
        if (atomic_xchg(&reset_hung_task, 0))   
            continue;   
   
        check_hung_uninterruptible_tasks(timeout);   
    }   
   
    return 0;   
}

本進程首先設置優先級為0，即一般優先級，不影響其他進程。然后進入主循環(每隔timeout時間執行一次)，首先讓進程睡眠，設置的睡眠時間為

CONFIG_DEFAULT_HUNG_TASK_TIMEOUT，可以通過內核配置選項修改，默認值為120s，睡眠結束被喚醒后判斷原子變量標識reset_hung_task，若被置位則跳過本輪監測，同時會清除該標識。該標識通過reset_hung_task_detector()函數設置(目前內核中尚無其他程序使用該接口)：

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
void reset_hung_task_detector(void)   
{   
    atomic_set(&reset_hung_task, 1);   
}   
EXPORT_SYMBOL_GPL(reset_hung_task_detector);

接下來循環的***即為監測函數check_hung_uninterruptible_tasks()，函數入參為監測超時時間。

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
/*  
 * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for  
 * a really long time (120 seconds). If that happens, print out  
 * a warning.  
 */   
static void check_hung_uninterruptible_tasks(unsigned long timeout)   
{   
    int max_count = sysctl_hung_task_check_count;   
    int batch_count = HUNG_TASK_BATCHING;   
    struct task_struct *g, *t;   
   
    /*  
     * If the system crashed already then all bets are off,  
     * do not report extra hung tasks:  
     */   
    if (test_taint(TAINT_DIE) || did_panic)   
        return;   
   
    rcu_read_lock();   
    for_each_process_thread(g, t) {   
        if (!max_count--)   
            goto unlock;   
        if (!--batch_count) {   
            batch_count = HUNG_TASK_BATCHING;   
            if (!rcu_lock_break(g, t))   
                goto unlock;   
        }   
        /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */   
        if (t->state == TASK_UNINTERRUPTIBLE)   
            check_hung_task(t, timeout);   
    }   
 unlock:   
    rcu_read_unlock();   
}

首先檢測內核是否已經DIE了或者已經panic了，如果是則表明內核已經crash了，無需再進行監測了，直接返回即可。注意這里的did_panic標識在前文中的panic通知鏈回調函數中hung_task_panic()置位：

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
static int   
hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)   
{   
    did_panic = 1;   
   
    return NOTIFY_DONE;   
}

接下去若尚無觸發內核crash，則進入監測流程并逐一檢測內核中的所有進程(任務task)，該過程在RCU加鎖的狀態下進行，因此為了避免在進程較多的情況下加鎖時間過長，這里設置了一個batch_count，一次最多檢測HUNG_TASK_BATCHING個進程。于此同時用戶也可以設定***的檢測個數max_count=sysctl_hung_task_check_count，默認值為***PID個數PID_MAX_LIMIT(通過sysctl命令設置)。

函數調用for_each_process_thread()函數輪詢內核中的所有進程(任務task)，僅對狀態處于TASK_UNINTERRUPTIBLE狀態的進程進行超時判斷，調用check_hung_task()函數，入參為task_struct結構和超時時間(120s)：

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
static void check_hung_task(struct task_struct *t, unsigned long timeout)   
{   
    unsigned long switch_count = t->nvcsw + t->nivcsw;   
   
    /*  
     * Ensure the task is not frozen.  
     * Also, skip vfork and any other user process that freezer should skip.  
     */   
    if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))   
        return;   
   
    /*  
     * When a freshly created task is scheduled once, changes its state to  
     * TASK_UNINTERRUPTIBLE without having ever been switched out once, it  
     * musn't be checked.  
     */   
    if (unlikely(!switch_count))   
        return;   
   
    if (switch_count != t->last_switch_count) {   
        t->last_switch_count = switch_count;   
        return;   
    }   
   
    trace_sched_process_hang(t);   
   
    if (!sysctl_hung_task_warnings)   
        return;   
   
    if (sysctl_hung_task_warnings > 0)   
        sysctl_hung_task_warnings--;

首先通過t->nvcsw和t->nivcsw的計數累加表示進程從創建開始至今的調度次數總和，其中t->nvcsw表示進程主動放棄CPU的次數，t->nivcsw表示被強制搶占的次數。隨后函數判斷幾個標識：(1)如果進程被frozen了那就跳過檢測;(2)調度次數為0的不檢測。

接下來判斷從上一次檢測時保存的進程調度次數和本次是否相同，若不相同則表明這輪timeout(120s)時間內進程發生了調度，則更新該調度值返回，否則則表明該進程已經有timeout(120s)時間沒有得到調度了，一直處于D狀態。接下來的trace_sched_process_hang()暫不清楚作用，然后判斷sysctl_hung_task_warnings標識，它表示需要觸發報警的次數，用戶也可以通過sysctl命令配置，默認值為10，即若當前檢測的進程一直處于D狀態，默認情況下此處每2分鐘發出一次告警，一共發出10次，之后不再發出告警。下面來看告警代碼：

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
/*  
 * Ok, the task did not get scheduled for more than 2 minutes,  
 * complain:  
 */   
pr_err("INFO: task %s:%d blocked for more than %ld seconds.\n",   
    t->comm, t->pid, timeout);   
pr_err("      %s %s %.*s\n",   
    print_tainted(), init_utsname()->release,   
    (int)strcspn(init_utsname()->version, " "),   
    init_utsname()->version);   
pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""   
    " disables this message.\n");   
sched_show_task(t);   
debug_show_held_locks(t);   
   
touch_nmi_watchdog();

這里會在控制臺和日志中打印死鎖任務的名稱、PID號、超時時間、內核tainted信息、sysinfo、內核棧barktrace以及寄存器信息等。如果開啟了debug lock則打印鎖占用的情況，并touch nmi_watchdog以防止nmi_watchdog超時(對于我的ARM環境無需考慮nmi_watchdog)。

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
if (sysctl_hung_task_panic) {   
    trigger_all_cpu_backtrace();   
    panic("hung_task: blocked tasks");   
}

***如果設置了sysctl_hung_task_panic標識則直接觸發panic(該值可通過內核配置文件配置也可以通過sysctl設置)。

二、示例演示

演示環境：樹莓派b(Linux 4.1.15)

1、首先確認內核配置選項以確認開啟hung stak機制

[cpp] view plain copy  在CODE上查看代碼片派生到我的代碼片 
#include      
#include      
#include      
#include    
   
DEFINE_MUTEX(dlock);   
   
static int __init dlock_init(void)   
{   
    mutex_lock(&dlock);   
    mutex_lock(&dlock);    
       
    return 0;   
}   
   
static void __exit dlock_exit(void)    
{   
    return;   
}   
   
module_init(dlock_init);     
module_exit(dlock_exit);     
MODULE_LICENSE("GPL");

本示例程序定義了一個mutex鎖，然后在模塊的init函數中重復加鎖，人為造成死鎖現象(mutex_lock()函數會調用__mutex_lock_slowpath()將進程設置為TASK_UNINTERRUPTIBLE狀態)，進程進入D狀態后是無法退出的。可以通過ps命令來查看：

root@apple:~# busybox ps  
PID USER TIME COMMAND  
...... 
 
521 root 0:00 insmod dlock.ko  
......

然后查看該進程的狀態，可見已經進入了D狀態。

root@apple:~# cat /proc/521/status  
Name: insmod  
State: D (disk sleep)  
Tgid: 521  
Ngid: 0  
Pid: 521

至此在等待兩分鐘后調試串口就會輸出以下信息，可見每兩分鐘就會輸出一次：

[ 360.625466] INFO: task insmod:521 blocked for more than 120 seconds.  
[ 360.631878] Tainted: G O 4.1.15 #5  
[ 360.637042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.  
[ 360.644986] [] (__schedule) from [] (schedule+0x40/0xa4)  
[ 360.652129] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c)  
[ 360.660570] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4)  
[ 360.670142] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48)  
[ 360.678432] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock])  
[ 360.686480] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8)  
[ 360.694976] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0)  
[ 360.703170] [] (do_init_module) from [] (load_module+0x1690/0x1d34)  
[ 360.711284] [] (load_module) from [] (SyS_init_module+0xdc/0x130)  
[ 360.719239] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54)  
[ 480.725351] INFO: task insmod:521 blocked for more than 120 seconds.  
[ 480.731759] Tainted: G O 4.1.15 #5  
[ 480.736917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.  
[ 480.744842] [] (__schedule) from [] (schedule+0x40/0xa4)  
[ 480.752029] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c)  
[ 480.760479] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4)  
[ 480.770066] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48)  
[ 480.778363] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock])  
[ 480.786402] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8)  
[ 480.794897] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0)  
[ 480.803085] [] (do_init_module) from [] (load_module+0x1690/0x1d34)  
[ 480.811188] [] (load_module) from [] (SyS_init_module+0xdc/0x130)  
[ 480.819113] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54)  
[ 600.825353] INFO: task insmod:521 blocked for more than 120 seconds.  
[ 600.831759] Tainted: G O 4.1.15 #5  
[ 600.836916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.  
[ 600.844865] [] (__schedule) from [] (schedule+0x40/0xa4)  
[ 600.852005] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c)  
[ 600.860445] [] (schedule_preempt_disabled) from [] (__mutex_lock_slowpath+0x6c/0xe4)  
[ 600.870014] [] (__mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48)  
[ 600.878303] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock])  
[ 600.886339] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8)  
[ 600.894835] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0)  
[ 600.903023] [] (do_init_module) from [] (load_module+0x1690/0x1d34)  
[ 600.911133] [] (load_module) from [] (SyS_init_module+0xdc/0x130)  
[ 600.919059] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54)

三、總結

D狀態死鎖一般在驅動開發的過程中比較常見，且不太容易定位，內核提供這種hung task機制，開發人員只需要將這些輸出的定位信息抓取并保留下來就可以快速的進行定位。

責任編輯：武曉燕來源： chinaunix

成人免费xxxxx在线视频软件_久久精品久久久_亚洲国产精品久久久_天天色天天色_亚洲人成一区_欧美一级欧美三级在线观看

Linux內核調試技術——進程D狀態死鎖檢測