浅析Linux中的进程调度

2016-11-22

前面在看软中断的时候，牵扯到不少进程调度的知识，这方面自己确实一直不怎么了解，就趁这个机会好好学习下。

现代的操作系统都是多任务的操作系统，尽管随着科技的发展，硬件的处理器核心越来越多，但是仍然不能保证一个进程对应一个核心，这就势必需要一个管理单元，负责调度进程，由管理单元来决定下一刻应该由谁使用CPU，这里充当管理单元的就是进程调度器。

　　进程调度器的任务就是合理分配CPU时间给运行的进程，创造一种所有进程并行运行的错觉。这就对调度器提出了要求：

1、调度器分配的CPU时间不能太长，否则会导致其他的程序响应延迟，难以保证公平性。

2、调度器分配的时间也不能太短，每次调度会导致上下文切换，这种切换开销很大。

而调度器的任务就是：1、分配时间给进程 2、上下文切换

所以具体而言，调度器的任务就明确了：用一句话表述就是在恰当的实际，按照合理的调度算法，切换两个进程的上下文。

调度器的结构

在Linux内核中，调度器可以分成两个层级，在进程中被直接调用的成为通用调度器或者核心调度器，他们作为一个组件和进程其他部分分开，而通用调度器和进程并没有直接关系，其通过第二层的具体的调度器类来直接管理进程。具体架构如下图：

如上图所示，每个进程必然属于一个特定的调度器类，Linux会根据不同的需求实现不同的调度器类。各个调度器类之间具备一定的层次关系，即在通用调度器选择进程的时候，会从最高优先级的调度器类开始选择，如果通用调度器类没有可运行的进程，就选择下一个调度器类的可用进程，这样逐层递减。

每个CPU会维护一个调度队列称之为就绪队列，每个进程只会出现在一个就绪队列中，因为同一进程不能同时被两个CPU选中执行。就绪队列的数据结构为struct rq,和上面的层次结构一样，通用调度器直接和rq打交道，而具体和进程交互的是特定于调度器类的子就绪队列。

调度器类

在linux内核中实现了一个调度器类的框架，其中定义了调度器应该实现的函数，每一个具体的调度器类都要实现这些函数。

在当前linux版本中（3.11.1），使用了四个调度器类：stop_sched_class、rt_sched_class、fair_sched_class、idle_sched_class。在最新的内核中又添加了一个调度类dl_sched_class，但是由于笔者能力所限，且大部分进程都是属于实时调度器和完全公平调度器，所以我们主要分析实时调度器和完全公平调度器。

看下调度器类的定义：

 struct sched_class {

     const struct sched_class *next;

     void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);

     void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);

     void (*yield_task) (struct rq *rq);

     bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);

     void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

     struct task_struct * (*pick_next_task) (struct rq *rq);

     void (*put_prev_task) (struct rq *rq, struct task_struct *p);

 #ifdef CONFIG_SMP

     int  (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);

     void (*migrate_task_rq)(struct task_struct *p, int next_cpu);

     void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);

     void (*post_schedule) (struct rq *this_rq);

     void (*task_waking) (struct task_struct *task);

     void (*task_woken) (struct rq *this_rq, struct task_struct *task);

     void (*set_cpus_allowed)(struct task_struct *p,

                  const struct cpumask *newmask);

     void (*rq_online)(struct rq *rq);

     void (*rq_offline)(struct rq *rq);

 #endif

     void (*set_curr_task) (struct rq *rq);

     void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);

     void (*task_fork) (struct task_struct *p);

     void (*switched_from) (struct rq *this_rq, struct task_struct *task);

     void (*switched_to) (struct rq *this_rq, struct task_struct *task);

     void (*prio_changed) (struct rq *this_rq, struct task_struct *task,

                  int oldprio);

     unsigned int (*get_rr_interval) (struct rq *rq,

                      struct task_struct *task);

 #ifdef CONFIG_FAIR_GROUP_SCHED

     void (*task_move_group) (struct task_struct *p, int on_rq);

 #endif

 };

enqueue_task向就绪队列添加一个进程，该操作发生在一个进程变成就绪态（可运行态）的时候。

dequeue_task就是执行enqueue_task的逆操作，在一个进程由运行态转为阻塞的时候就会发生该操作。

yield_task用于进程自愿放弃控制权的时候。

pick_next_task用于挑选下一个可运行的进程，发生在进程调度的时候，又调度器调用。

set_curr_task当进程的调度策略发生变化时，需要执行此函数

task_tick，在每次激活周期调度器时，由周期调度器调用。

task_fork用于建立fork系统调用和调度器之间的关联，每次新进程建立后，就调用该函数通知调度器。

就绪队列

如前所述，每个CPU维护一个就绪队列，由结构struct rq表示，通用调度器直接和rq交互，在rq中又维护了子就绪队列，这些子就绪队列和具体的调度器类相关，进程入队出队都需要根据调度器类的具体算法。

rq为维护针对当前CPU而言全局的信息，其结构比较庞大，这里就不详细列举。其大致内容包括当前CPU就绪队列包含的所有进程数、记载就绪队列当前负荷的度量，并嵌入子就绪队列cfs_rq和rt_rq等，系统所有的就绪队列位于一个runqueues数组中，每个CPU对应一个元素。

内核也定义了一些宏，操作这些全局的队列：

 #define cpu_rq(cpu)        (&per_cpu(runqueues, (cpu)))

 #define this_rq()        (&__get_cpu_var(runqueues))

 #define task_rq(p)        cpu_rq(task_cpu(p))

 #define cpu_curr(cpu)        (cpu_rq(cpu)->curr)

 #define raw_rq()        (&__raw_get_cpu_var(runqueues))

调度实体

linux中可调度的不仅仅是进程，也可能是一个进程组，所以LInux就把调度对象抽象化成一个调度实体。就像是很多结构中嵌入list_node用于连接链表一样，这里需要执行调度的也就需要加入这样一个调度实体。实际上，调度器直接操作的也是调度实体，只是会根据调度实体获取到其对应的结构。

 struct sched_entity {

     struct load_weight    load;        /* for load-balancing */

     struct rb_node        run_node;

     struct list_head    group_node;

     unsigned int        on_rq;

     u64            exec_start;

     u64            sum_exec_runtime;

     u64            vruntime;

     u64            prev_sum_exec_runtime;

     u64            nr_migrations;

 #ifdef CONFIG_SCHEDSTATS

     struct sched_statistics statistics;

 #endif

 #ifdef CONFIG_FAIR_GROUP_SCHED

     struct sched_entity    *parent;

     /* rq on which this entity is (to be) queued: */

     struct cfs_rq        *cfs_rq;

     /* rq "owned" by this entity/group: */

     struct cfs_rq        *my_q;

 #endif

 #ifdef CONFIG_SMP

     /* Per-entity load-tracking */

     struct sched_avg    avg;

 #endif

 };

load用于负载均衡，决定了各个实体占队列中负荷的比例，计算负荷权重是调度器的主要责任，因为选择下一个进程就是要根据这些信息。run_node是一个红黑树节点，用于把实体加入到红黑树，on_rq表明该实体是否位于就绪队列，当为1的时候就说明在就绪队列中，一个进程在得到调度的时候会从就绪队列中摘除，在让出CPU的时候会重新添加到就绪队列（正常调度的情况，不包含睡眠、等待）。在后面有一个时间相关的字段，exec_start记录进程开始在CPU上运行的时间；sum_exec_time记录进程一共在CPU上运行的时间，pre_sum_exec_time记录本地调度之前，进程已经运行的时间。在进程被调离CPU的时候，会把sum_exec_time的值保存到pre_sum_exec_time，而sum_exec_time并不重置，而是一直随着在CPU上的运行递增。而vruntime 记录在进程执行期间，在虚拟时钟上流逝的时间，用于CFS调度器，后面会具体讲述。后面的parent、cfs_rq、my_rq是和组调度相关的，这里我们暂且不涉及。

看下load字段

struct load_weight {

    unsigned long weight, inv_weight;

};

这里有两个字段，weight和inv_weight。前者是当前实体优先级对应的权重，这个可以根据prio_to_weight数组转化得到。而后者是是用于快速计算vruntime用的，可以通过prio_to_wmult数组得到，后者是一个和prio_to_weight同样大小的数组，每一项的值为2^32/weight，内核中的除法运算没那么简单，为了加速操作，选取的折中办法。vruntime的计算可以参考calc_delta_mine函数。

到这里，调度器的基本架构就比较清楚了，调度过程中需要计算进程的优先级，这点是比较复杂的过程，我们单独分一节描述，下面根据CFS调度类探索下进程调度的过程。

什么时候调度
如何进行调度

进程调度并不是什么时候都可以，前面也说过，系统会有一个周期调度器，根据频率自动调用schedule_tick函数。其主要作用就是根据进程运行时间触发调度；在进程遇到资源等待被阻塞也可以显示的调用调度器函数进行调度；另外在有内核空间返回到用户空间时，会判断当前是否需要调度，在进程对应的thread_info结构中，有一个flag，该flag字段的第二位（从0开始）作为一个重调度标识TIF_NEED_RESCHED，当被设置的时候表明此时有更高优先级的进程，需要执行调度。另外目前的内核支持内核抢占功能，在适当的时机可以抢占内核的运行。关于内核抢占，我们最后论述。

而至于如何进行调度呢？就要看具体调度器类了。一旦确定了要进行调度，那么schedule函数被调用。注意，周期性调度器并不直接调度，至多设置进程的重调度位TIF_NEED_RESCHED，这样在返回用户空间的时候仍然由主调度器执行调度。跟踪下schedule函数，其实具体实现由__schedule函数完成，直接看该函数：

 static void __sched __schedule(void)

 {

     struct task_struct *prev, *next;

     unsigned long *switch_count;

     struct rq *rq;

     int cpu;

 need_resched:

     /*禁止内核抢占*/

     preempt_disable();

     cpu = smp_processor_id();

     /*获取CPU 的调度队列*/

     rq = cpu_rq(cpu);

     rcu_note_context_switch(cpu);

     /*保存当前任务*/

     prev = rq->curr;

     schedule_debug(prev);

     if (sched_feat(HRTICK))

         hrtick_clear(rq);

     /*

      * Make sure that signal_pending_state()->signal_pending() below

      * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)

      * done by the caller to avoid the race with signal_wake_up().

      */

     smp_mb__before_spinlock();

     raw_spin_lock_irq(&rq->lock);

     switch_count = &prev->nivcsw;

      /*  如果内核态没有被抢占, 并且内核抢占有效

         即是否同时满足以下条件：

         1  该进程处于停止状态

         2  该进程没有在内核态被抢占 */

     if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

         if (unlikely(signal_pending_state(prev->state, prev))) {

             prev->state = TASK_RUNNING;

         } else {

             deactivate_task(rq, prev, DEQUEUE_SLEEP);

             prev->on_rq = ;

             /*

              * If a worker went to sleep, notify and ask workqueue

              * whether it wants to wake up a task to maintain

              * concurrency.

              */

             if (prev->flags & PF_WQ_WORKER) {

                 struct task_struct *to_wakeup;

                 to_wakeup = wq_worker_sleeping(prev, cpu);

                 if (to_wakeup)

                     try_to_wake_up_local(to_wakeup);

             }

         }

         switch_count = &prev->nvcsw;

     }

     pre_schedule(rq, prev);

     if (unlikely(!rq->nr_running))

         idle_balance(cpu, rq);

     /*告诉调度器prev进程即将被调度出去*/

     put_prev_task(rq, prev);

     /*挑选下一个可运行的进程*/

     next = pick_next_task(rq);

     /*清除pre的TIF_NEED_RESCHED标志*/

     clear_tsk_need_resched(prev);

     rq->skip_clock_update = ;

    /*如果next和当前进程不一致，就可以调度*/

     if (likely(prev != next)) {

         rq->nr_switches++;

         /*设置当前调度进程为next*/

         rq->curr = next;

         ++*switch_count;

         /*切换进程上下文*/

         context_switch(rq, prev, next); /* unlocks the rq */

         /*

          * The context switch have flipped the stack from under us

          * and restored the local variables which were saved when

          * this task called schedule() in the past. prev == current

          * is still correct, but it can be moved to another cpu/rq.

          */

         cpu = smp_processor_id();

         rq = cpu_rq(cpu);

     } else

         raw_spin_unlock_irq(&rq->lock);

     post_schedule(rq);

     sched_preempt_enable_no_resched();

     if (need_resched())

         goto need_resched;

 }

调度器运行期间是要禁止内核抢占的，从级别上来讲，LInux中的调度器不见得比其他进程的级别高，但是肯定不会低于普通进程，即调度器运行期间会禁止内核抢占。相比之下，windows中使用中断请求级别的概念，普通进程运行在passive level，而调度器运行在DPC level，调度器运行期间只有硬件中断可以打断。从函数代码来看，这里首先调用了preempt_disable函数设置了preempt_count禁止内核抢占，然后获取当前CPU的就绪队列结构rq，prev保存当前任务,下面的prev->state && !(preempt_count() & PREEMPT_ACTIVE)是对有些进行移除运行队列。具体就是如果当前进程是阻塞并且PREEMPT_ACTIVE没有被设置，就有了移除就绪队列的条件，然后判断是否又挂起的信号，如果有，那么暂时不移除队列，否则就执行deactivate_task函数移除队列，并设置prev->on_rq=0，表明该进程不在就绪队列中。下面的if是判断如果当前进程是一个工作线程，那么就通知工作队列，看是否需要唤醒另一个worker。

出了if就调用了pre_schedule，该函数在CFS中没有实现，而在实时调度器中实现了，具体什么作用不太清楚。

下面的一个if判断当前CPU就绪队列是否存在可运行的进程，如果不存在即没有进程可以运行就调用idle_balance从其他的CPU平衡一下任务。当然这种情况极少见。

接下来就要进行正式工作了，调用put_prev_task预处理下，具体是调用对应调度器类的实现函数：prev->sched_class->put_prev_task(rq, prev);主要任务是把当前任务重新加入就绪队列。当然在此之前如果当前任务还在就绪队列（或者说是当前任务是否是可运行状态），就调用update_curr更新下其进程时间，包括vruntime等。

重要的是下面的pick_next_task，它使用对应的调度器类选择一个具体的任务作为下一个占用CPU的任务，选定好之后就调用clear_tsk_need_resched清楚prev的重调度标识。

之后进行if判断，如果prev不是我们选择的下一个进程，就执行进程的切换。具体先设置就绪队列的切换计数nr_switches,然后设置rq->curr=next,这里就从就绪队列而言，已经标识next为当前进程了。然后就调用context_switch函数切换上下文，主要包含两部分：切换地址空间、切换寄存器域。之后就开启内核抢占，这样一个进程切换就完成了。最后会判断是否又有新设置的高优先级进程，有的话再次执行调度。

前面大致过程如前所述，但是具体而言，下一个进程的执行是从替换了进程的EIP开始执行的，即调度器函数不到结束就开始运行另一个进程了，而待下次这个进程重新获得控制权时，就从之前保存的状态开始运行。在__schedule函数的最后仍然会判断是否被设置了重调度位，如果被设置了，那么很不幸，又要被调度出去了，但是这种几率很小，只是以防万一而已。这样出了调度函数，正常运行了。

核心函数实现分析：

pick_next_task

 static inline struct task_struct *

 pick_next_task(struct rq *rq)

 {

     const struct sched_class *class;

     struct task_struct *p;

     /*

      * Optimization: we know that if all tasks are in

      * the fair class we can call that function directly:

      */

      /*如果所有任务都处于完全公平调度类，则可以直接选择下一个任务*/

     if (likely(rq->nr_running == rq->cfs.h_nr_running)) {

         p = fair_sched_class.pick_next_task(rq);

         if (likely(p))

             return p;

     }

     /*从优先级最高的调度器类开始遍历，顺序为stop_sched_class->rt_scheduled_class->fair_schedled_class->idle_sched_class*/

     /*

     #define for_each_class(class) \

    for (class = sched_class_highest; class; class = class->next)

     */

     for_each_class(class) {

         p = class->pick_next_task(rq);

         if (p)

             return p;

     }

     BUG(); /* the idle class will always have a runnable task */

 }

该函数还是处于主调度器的层面，没有涉及到核心逻辑，所以还比较好理解。首先判断当前CPu就绪队列上的可运行进程数和CFS就绪队列上的可运行进程数是否一致，如果一致就说明当前主就绪队列上没有只有CFS调度类的进程，那么这样直接调用CFS调度类的方法挑选下一个进程即可。否则还需要从最高级的调度类，层层选择。下面的for_each_class便是实现这个功能。它按照stop_sched_class->rt_scheduled_class->fair_schedled_class->idle_sched_class这个顺序，依次调用其pick函数，只有前一个调度类没有找到可运行的进程，才会查找后一个调度类。我们这里值看CFS的实现：

在fair.c中，对应的函数是pick_next_task_fair

 static struct task_struct *pick_next_task_fair(struct rq *rq)

 {

     struct task_struct *p;

     /*从CPU 的就绪队列找到公平调度队列*/

     struct cfs_rq *cfs_rq = &rq->cfs;

     struct sched_entity *se;

     /*如果公平调度类没有可运行的进程，直接返回*/

     if (!cfs_rq->nr_running)

         return NULL;

      /*如果调度的是一组进程，则需要进行循环设置，否则执行一次就退出了*/

     do {

         /*从公平调度类中找到一个可运行的实体*/

         se = pick_next_entity(cfs_rq);

         /*设置红黑树中下一个实体，并标记cfs_rq->curr为se*/

         set_next_entity(cfs_rq, se);

         cfs_rq = group_cfs_rq(se);

     } while (cfs_rq);

     /*获取到具体的task_struct*/

     p = task_of(se);

     if (hrtick_enabled(rq))

         hrtick_start_fair(rq, p);

     return p;

 }

代码也是比较简单的，核心在下面的那个do循环中，从这里看该循环做了两个事情，调用pick_next_entity从CFS就绪队列中选择一个调度实体，然后调用set_next_entity设置下一个可以调度的任务，由于CFS的调度实体通过红黑树维护，所以这里实际上是调整红黑树的过程。而使用循环时应用与组调度的场合，这里我们暂且忽略。

看下pick_next_entity

 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)

 {

     /*从红黑树中找到最左边即等待时间最长的那个实体*/

     struct sched_entity *se = __pick_first_entity(cfs_rq);

     struct sched_entity *left = se;

     /*

      * Avoid running the skip buddy, if running something else can

      * be done without getting too unfair.

      */

     if (cfs_rq->skip == se) {

         struct sched_entity *second = __pick_next_entity(se);

         if (second && wakeup_preempt_entity(second, left) < )

             se = second;

     }

     /*

      * Prefer last buddy, try to return the CPU to a preempted task.

      */

     if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < )

         se = cfs_rq->last;

     /*

      * Someone really wants this to run. If it's not unfair, run it.

      */

     if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < )

         se = cfs_rq->next;

     clear_buddies(cfs_rq, se);

     return se;

 }

该函数核心在__pick_first_entity，其本身也是很简答的，不妨看下代码：

 struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)

 {

     struct rb_node *left = cfs_rq->rb_leftmost;

     if (!left)

         return NULL;

     return rb_entry(left, struct sched_entity, run_node);

 }

以为每次选择后都会设置好下一个应该选择的，所以这里仅仅获取下cfs_rq->rb_leftmost就可以了，然后就进行了三个if判断，但是都使用了同一个函数wakeup_preempt_entity

然后返回相应的调用实体。last表示最后一个调用唤醒操作的进程，next表示最后一个被唤醒的进程。

 static int

 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)

 {

     s64 gran, vdiff = curr->vruntime - se->vruntime;

     if (vdiff <= )

         return -;

     gran = wakeup_gran(curr, se);

     if (vdiff > gran)

         return ;

     return ;

 }

这里是对比两个实体的vruntime，如果curr->vruntime - se->vruntime大于一个固定值，那么就返回1.这个值一般是sysctl_sched_wakeup_granularity。

所以这里逻辑当选定一个实体后，判断该实体是否是cfs_rq指定跳过的实体，如果是就选择下一个实体，判断该实体和上一个实体的vruntime的差距，只要不大于阈值，就可以接收从而选择后者。

在设定好初始实体后，判断cfs_rq->last和left的vruntime,如果在可接受的范围内，则选择cfs_rq->last，然后接着判断cfs_rq->next，如果仍然在可接收的范围内，就选择cfs_rq->next作为最终选定的调度实体。NEXT_BUDDY表示在cfs选择next sched_entity的时候会优先选择最后一个唤醒的sched_entity，而 LAST_BUDDY表示在cfs选择next sched_entity的时候会优先选择最后一个执行唤醒操作的那个sched_entity，这两种调度策略都有助于提高cpu cache的命中率。从代码来看，next比last优先级更高！

 static void

 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

 {

     /* 'current' is not kept within the tree. */

     /*如果该实体处于就绪态，就可以被调度*/

     if (se->on_rq) {

         /*

          * Any task has to be enqueued before it get to execute on

          * a CPU. So account for the time it spent waiting on the

          * runqueue.

          */

         update_stats_wait_end(cfs_rq, se);

         /*把se出队列，然后选取一个实体设置到红黑树的最左边*/

         __dequeue_entity(cfs_rq, se);

     }

     update_stats_curr_start(cfs_rq, se);

     cfs_rq->curr = se;

 #ifdef CONFIG_SCHEDSTATS

     /*

      * Track our maximum slice length, if the CPU's load is at

      * least twice that of our own weight (i.e. dont track it

      * when there are only lesser-weight tasks around):

      */

     if (rq_of(cfs_rq)->load.weight >= *se->load.weight) {

         se->statistics.slice_max = max(se->statistics.slice_max,

             se->sum_exec_runtime - se->prev_sum_exec_runtime);

     }

 #endif

     se->prev_sum_exec_runtime = se->sum_exec_runtime;

 }

该函数首先判断选定的实体是否可运行，如果可以就调用调度器的出队函数，把进程出队，然后调整红黑树，这里为何用个if笔者不是很懂，进程被调用前都要入队，所以这里选定的se，其on_rq肯定为1呀。难道不是么？之后设置se->exec_start记录开始运行的时间，然后设置cfs—>curr=se。之后就是设置时间等操作。然后就执行返回了。回到pick_next_task_fair函数中，这里接下来就返回了实体对应的task_struct。

context_switch函数

 static inline void

 context_switch(struct rq *rq, struct task_struct *prev,

            struct task_struct *next)

 {

     struct mm_struct *mm, *oldmm;

     /*进程切换准备工作，需要枷锁和关中断，最后需要调用finish_task_switch*/

     prepare_task_switch(rq, prev, next);

     mm = next->mm;

     oldmm = prev->active_mm;

     /*

      * For paravirt, this is coupled with an exit in switch_to to

      * combine the page table reload and the switch backend into

      * one hypercall.

      */

     arch_start_context_switch(prev);

     /*如果将要执行的是内核线程*/

     if (!mm) {

         next->active_mm = oldmm;

         atomic_inc(&oldmm->mm_count);

         enter_lazy_tlb(oldmm, next);

     } else

         switch_mm(oldmm, mm, next);

     /*如果被调度的是内核线程*/

     if (!prev->mm) {

         prev->active_mm = NULL;

         rq->prev_mm = oldmm;

     }

     /*

      * Since the runqueue lock will be released by the next

      * task (which is an invalid locking op but in the case

      * of the scheduler it's an obvious special-case), so we

      * do an early lockdep release here:

      */

 #ifndef __ARCH_WANT_UNLOCKED_CTXSW

     spin_release(&rq->lock.dep_map, , _THIS_IP_);

 #endif

     context_tracking_task_switch(prev, next);

     /* Here we just switch the register state and the stack. */

     /*切换寄存器域和栈*/

     switch_to(prev, next, prev);

     barrier();

     /*

      * this_rq must be evaluated again because prev may have moved

      * CPUs since it called schedule(), thus the 'rq' on its stack

      * frame will be invalid.

      */

     finish_task_switch(this_rq(), prev);

 }

这部分内容主要做了两件事情：切换地址空间、切换寄存器域和栈空间。整个切换过程需要加锁和关中断，首先切换的是地址空间，mm 和active_mm分别代表调度和被调度的进程的 mm_struct，如果mm为空，则表明next是内核线程，内核线程没有自己独立的地址空间，所以其mm为null，运行的时候使用prev的active_mm即可。如果非空，则是用户进程，那么可以直接切换，这里调用

switch_mm函数进行切换；如果prev为内核线程，由于其没有独立地址空间，所以需要设置其active_mm为null。

接下来就要调用switch_to来切换寄存器域和栈了。这也是进程切换的最后部分。

 #define switch_to(prev, next, last)                    \

 do {                                    \

     /*                                \

      * Context-switching clobbers all registers, so we clobber    \

      * them explicitly, via unused output variables.        \

      * (EAX and EBP is not listed because EBP is saved/restored    \

      * explicitly for wchan access and EAX is the return value of    \

      * __switch_to())                        \

      */                                \

     unsigned long ebx, ecx, edx, esi, edi;                \

                                     \

     asm volatile("pushfl\n\t"        /* save    flags */    \

              "pushl %%ebp\n\t"        /* save    EBP   */    \

              "movl %%esp,%[prev_sp]\n\t"    /* save    ESP   */ \

              "movl %[next_sp],%%esp\n\t"    /* restore ESP   */ \

              "movl $1f,%[prev_ip]\n\t"    /* save    EIP   */    \

              "pushl %[next_ip]\n\t"    /* restore EIP   */    \

              __switch_canary                    \

              "jmp __switch_to\n"    /* regparm call  */    \

              "1:\t"                        \

              "popl %%ebp\n\t"        /* restore EBP   */    \

              "popfl\n"            /* restore flags */    \

                                     \

              /* output parameters */                \

              : [prev_sp] "=m" (prev->thread.sp),        \

                [prev_ip] "=m" (prev->thread.ip),        \

                "=a" (last),                    \

                                     \

                /* clobbered output registers: */        \

                "=b" (ebx), "=c" (ecx), "=d" (edx),        \

                "=S" (esi), "=D" (edi)                \

                                            \

                __switch_canary_oparam                \

                                     \

                /* input parameters: */                \

              : [next_sp]  "m" (next->thread.sp),        \

                [next_ip]  "m" (next->thread.ip),        \

                                            \

                /* regparm parameters for __switch_to(): */    \

                [prev]     "a" (prev),                \

                [next]     "d" (next)                \

                                     \

                __switch_canary_iparam                \

                                     \

              : /* reloaded segment registers */            \

             "memory");                    \

 } while ()

实际上switch_to是一个宏，由一大串的汇编代码实现，这段代码稍微有点复杂，首先把标识寄存器压栈，然后ebp入栈，接着保存当前esp指针到prev->thread.sp变量中，第15行就把next->thread.sp设置到当前esp寄存器了，也就是说从现在开始使用的是next进程的栈，但是EBP 还没有切换，所以还可以使用prev进程的变量。接下来movl $1f,%[prev_ip]是把标号1的地址保存到prev->thread.ip，这个就是下次prev进程被调度的时候，开始执行的IP。到目前为止，prev进程的状态域就保存好了，接下来pushl %[next_ip]是把next进程的起始EIP压栈，因为后面直接调用一个函数，所以在函数返回后执行执行ret指令，就直接从栈中取出地址放到EIP开始执行，next进程正是从此处开始执行即代码中标号1的位置。第19行直接使用jmp跳转到目标函数__switch_to，主要是使用call会自动push eip，这样函数返回后又从原位置开始，要执行next还需要手动切换EIP，比较麻烦。

切换到next后，就从next进程的标号1位置开始，即popl %%ebp,popfl.需要注意的是，__switch_to的参数在最下方的部分

[prev] "a" (prev), \
[next] "d" (next)

被放到eax和edx中，不明白的人可能怎么也发现不了，涉及到AT&T汇编语法，这里就不在多说，具体可以参考相关文档。

到这里进程便切换过来了，但是细心的人可能会注意到，这里switch_to本来仅仅是切换两个进程，却传递进去三个参数，这是为何？

具体来说，这是为了让被调度到的进程知道在他之前运行的实际进程，为何这么说呢？

下面三个调度，在switch_to执行的时候，状态如下：

1、A——>B prev=A next=B

2、B——>C prev=B next=C

3、C——>A prev=C next=A

看第三次调度的时候，这个时候A重新获得控制权，恢复了A的栈状态，即在A的进程空间，prev=A next=B，而A并不知道在他之前实际运行的是C，所以需要一种方式告知A，在他之前实际运行的进程是C。

参考资料：

http://blog.chinaunix.net/uid-27767798-id-3548384.html
linux内核3.11.1源码

巴特西

浅析Linux中的进程调度

最新文章

热门文章