linux内核CFS进程调度策略

一、概述

首先简介一下主要的设计思路，

CFS思路非常easy。就是依据各个进程的权重分配执行时间(权重怎么来的后面再说)。

进程的执行时间计算公式为:

分配给进程的执行时间 = 调度周期 * 进程权重 / 全部进程权重之和 (公式1)

调度周期非常好理解。就是将全部处于TASK_RUNNING态进程都调度一遍的时间，

几乎相同相当于O(1)调度算法中执行队列和过期队列切换一次的时间

（我对O(1)调度算法看得不是非常熟，如有错误还望各位大虾指出）。

举个样例。比方仅仅有两个进程A, B，权重分别为1和2，

调度周期设为30ms，那么分配给A的CPU时间为

30ms * (1/(1+2)) = 10ms

而B的CPU时间为

30ms * (2/(1+2)) = 20ms

那么在这30ms中A将执行10ms。B将执行20ms。

公平怎么体现呢？它们的执行时间并不一样阿？

事实上公平是体如今另外一个量上面。叫做virtual runtime(vruntime)。它记录着进程已经执行的时间，

可是并非直接记录，而是要依据进程的权重将执行时间放大或者缩小一个比例。

我们来看下从实际执行时间到vruntime的换算公式

vruntime = 实际执行时间 * 1024 / 进程权重。 (公式2)

为了不把大家搞晕。这里我直接写1024。实际上它等于nice为0的进程的权重，代码中是NICE_0_LOAD。

也就是说。全部进程都以nice为0的进程的权重1024作为基准。计算自己的vruntime添加速度。

还以上面AB两个进程为例。B的权重是A的2倍，那么B的vruntime添加速度仅仅有A的一半。

如今我们把公式2中的实际执行时间用公式1来替换。能够得到这么一个结果：

vruntime = (调度周期 * 进程权重 / 全部进程总权重) * 1024 / 进程权重=调度周期 * 1024 / 全部进程总权重

看出什么眉目没有？没错，尽管进程的权重不同，可是它们的vruntime增长速度应该是一样的（这里所说的增长速度一样，是从宏观上来看的。从上一篇文章能够看出来。而在上一篇文章中说vruntime的增量不同，是从公式分析得到的，算是局部分析，在公式2中，假设实际执行时间都是一样。非常显然权重小的增长的多。权重大的增长的小，我个人认为正是虚拟时钟的存在。转换了思想。才有了这个CFS,事实上还是依据权重来决定一个进程在一个调用周期内执行了多长时间，可是虚拟时钟决定了怎么调度这个过程，这就是思想），与权重无关。

好，既然全部进程的vruntime增长速度宏观上看应该是同一时候推进的。

那么就能够用这个vruntime来选择执行的进程。谁的vruntime值较小就说明它曾经占用cpu的时间较短，

受到了“不公平”对待，因此下一个执行进程就是它。

这样既能公平选择进程，又能保证高优先级进程

获得较多的执行时间。

这就是CFS的主要思想了。

再补充一下权重的来源，权重跟进程nice值之间有一一相应的关系，能够通过全局数组prio_to_weight来转换，

nice值越大，权重越低

以下来分析代码。网上已经有非常多cfs的文章。因此我打算换一个方式来写，选择几个点来进行情景分析，

包含进程创建时。进程被唤醒，主动调度(schedule)，时钟中断。

介绍代码之前先介绍一下CFS相关的结构

第一个是调度实体sched_entity，它代表一个调度单位。在组调度关闭的时候能够把他等同为进程。

每个task_struct中都有一个sched_entity，进程的vruntime和权重都保存在这个结构中。

那么全部的sched_entity怎么组织在一起呢？红黑树。全部的sched_entity以vruntime为key

(实际上是以vruntime-min_vruntime为单位，难道是防止溢出？反正结果是一样的)插入到红黑树中，

同一时候缓存树的最左側节点。也就是vruntime最小的节点，这样能够迅速选中vruntime最小的进程。

注意仅仅有等待CPU的就绪态进程在这棵树上，睡眠进程和正在执行的进程都不在树上。

我从ibm developer works上偷过来一张图来展示一下它们的关系：

汗。图片上传功能被关闭了。先盗链一个过来。别怪我没品哈。。。

如今開始分情景解析CFS。

二、创建进程

第一个情景选为进程创建时CFS相关变量的初始化。

我们知道。Linux创建进程使用fork或者clone或者vfork等系统调用，终于都会到do_fork。

假设没有设置CLONE_STOPPED，则会进入wake_up_new_task函数，我们看看这个函数的关键部分

/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
*/
void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
.....
if (!p->sched_class->task_new || !current->se.on_rq) {
activate_task(rq, p, 0);
} else {
/*
* Let the scheduling class do new task startup
* management (if any):
*/
p->sched_class->task_new(rq, p);
inc_nr_running(rq);
}
check_preempt_curr(rq, p, 0);
.....
}

上面那个if语句我不知道什么情况下会为真。我測试了一下。在上面两个分支各加一个计数器，

推断为真的情况仅仅有2次(我毫无依据的推測是idle进程和init进程)，而推断为假的情况有近万次。

因此我们仅仅看以下的分支，假设哪位前辈知道真相的话还望告诉我一声，万分感谢。

再以下就是检測是否可以形成抢占，假设新进程可以抢占当前进程则进行进程切换。

我们一个一个函数来看

p->sched_class->task_new相应的函数是task_new_fair:

/*
* Share the fairness runtime between parent and child, thus the
* total amount of pressure for CPU stays equal - new tasks
* get a chance to run but frequent forkers are not allowed to
* monopolize the CPU. Note: the parent runqueue is locked,
* the child is not running yet.
*/
static void task_new_fair(struct rq *rq, struct task_struct *p)
{
struct cfs_rq *cfs_rq = task_cfs_rq(p);
struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
int this_cpu = smp_processor_id();
sched_info_queued(p);
update_curr(cfs_rq);
place_entity(cfs_rq, se, 1);
/* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
curr && curr->vruntime < se->vruntime) {
/*
* Upon rescheduling, sched_class::put_prev_task() will place
* 'current' within the tree based on its new key value.
*/
swap(curr->vruntime, se->vruntime);
resched_task(rq->curr);
}
enqueue_task_fair(rq, p, 0);
}

这里有两个重要的函数，update_curr，place_entity。

当中update_curr在这里能够忽略。它是更新进程的一些随时间变化的信息。我们放到后面再看，

place_entity是更新新进程的vruntime值。以便把他插入红黑树。

新进程的vruntime确定之后有一个推断，满足下面几个条件时，交换父子进程的vruntime：

1.sysctl设置了子进程优先执行

2.fork出的子进程与父进程在同一个cpu上

3.父进程不为空（这个条件为什么会发生暂不明确，难道是fork第一个进程的时候？）

4.父进程的vruntime小于子进程的vruntime

几个条件都还比較好理解，说下第四个，由于CFS总是选择vruntime最小的进程执行，

因此必须保证子进程vruntime比父进程小，作者没有直接把子进程的vruntime设置为较小的值，

而是採用交换的方法，能够防止通过fork新进程来大量占用cpu时间，立即还要讲到。

最后，调用enqueue_task_fair将新进程插入CFS红黑树中

以下我们看下place_entity是怎么计算新进程的vruntime的。

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;
/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);
if (!initial) {
//先不看这里，
}
se->vruntime = vruntime;
}

这里是计算进程的初始vruntime。

它以cfs队列的min_vruntime为基准，再加上进程在一次调度周期中所增长的vruntime。

这里并非计算进程应该执行的时间。而是先把进程的已经执行时间设为一个较大的值。

可是该进程明明还没有执行过啊，为什么要这样做呢？

如果新进程都能获得最小的vruntime(min_vruntime)，那么新进程会第一个被调度执行。

这样程序猿就能通过不断的fork新进程来让自己的程序一直占领CPU。这显然是不合理的，

这跟曾经使用时间片的内核中父子进程要平分父进程的时间片是一个道理。

再解释下min_vruntime，这是每一个cfs队列一个的变量，它一般小于等于全部就绪态进程

的最小vruntime。也有例外。比方对睡眠进程进行时间补偿会导致vruntime小于min_vruntime。

至于sched_vslice计算细节暂且不细看，大体上说就是把概述中给出的两个公式结合起来例如以下：

sched_vslice = (调度周期 * 进程权重 / 全部进程总权重) * NICE_0_LOAD / 进程权重

也就是算出进程应分配的实际cpu时间，再把它转化为vruntime。

把这个vruntime加在进程上之后，就相当于觉得新进程在这一轮调度中已经执行过了。

好了。到这里又能够回到wake_up_new_task(希望你还没晕，能回溯回去:-))，

看看check_preempt_curr(rq, p, 0);这个函数就直接调用了check_preempt_wakeup

/*
* Preempt the current task with a newly woken task if needed:
*/我略去了一些不太重要的代码
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se; //se是当前进程。pse是新进程
/*
* Only set the backward buddy when the current task is still on the
* rq. This can happen when a wakeup gets interleaved with schedule on
* the ->pre_schedule() or idle_balance() point, either of which can
* drop the rq lock.
*
* Also, during early boot the idle thread is in the fair class, for
* obvious reasons its a bad idea to schedule back to the idle thread.
*/
if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
set_last_buddy(se);
set_next_buddy(pse);
while (se) {
if (wakeup_preempt_entity(se, pse) == 1) {
resched_task(curr);
break;
}
se = parent_entity(se);
pse = parent_entity(pse);
}
}

首先对于last和next两个字段给予说明。

假设这两个字段不为NULL，那么last指向近期被调度出去的进程，next指向被调度上cpu的进程。

比如A正在执行，被B抢占。那么last指向A。next指向B。

这两个指针有什么用呢?

当CFS在调度点选择下一个执行进程时，会优先照应这两个进程。我们后面会看到，这里仅仅要记住。

<

这两个指针仅仅使用一次。就是在上面这个函数退出后，返回用户空间时会触发schedule，

在那里选择下一个调度进程时会优先选择next，次优先选择last。选择完后。就会清空这两个指针。

这样设计的原因是，在上面的函数中检測结果是能够抢占并不代表已经抢占，而仅仅是设置了调度标志，

在最后触发schedule时抢占进程B并不一定是终于被调度的进程(为什么？由于我们推断是否能抢占

的依据是抢占进程B比执行进程A的vruntime小，但红黑树中可能有比抢占进程B的vruntime更小的进程C，

这样在调度时就会选中vruntime最小的C，而不是抢占进程B)。可是我们当然希望优先调度B，

由于我们就是为了执行B才设置了调度标志，所以这里用一个next指针指向B，以便给他个后门走，

假设B实在不争气，vruntime太大。就还是继续执行被抢占进程A比較合理，因此last指向被抢占进程。

这是一个比next小一点的后门，假设next走后门失败，就让被抢占进程A也走一次后门，

假设被抢占进程A也不争气。vruntime也太大，仅仅好从红黑树中挑一个vruntime最小的了。

无论它们走后门是否成功，一旦选出下一个进程，就立马清空这两个指针，不能老开着这个后门吧。

须要注意的是，schedule中清空这两个指针仅仅在2.6.29及之后的内核才有。之前的内核没有那句话。

然后调用wakeup_preempt_entity检測是否满足抢占条件。假设满足（返回值为1）

则对当前进程设置TIF_NEED_RESCHED标志。在退出系统调用时会触发schedule函数进行进程切换,

这个函数后面再说。

我们看看wakeup_preempt_entity(se, pse)。到底怎么推断后者是否可以抢占前者

/*
* Should 'se' preempt 'curr'.
*
* |s1
* |s2
* |s3
* g
* |<--->|c
*
* w(c, s1) = -1
* w(c, s2) = 0
* w(c, s3) = 1
*
*/
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime;
if (vdiff <= 0)
return -1;
gran = wakeup_gran(curr);
if (vdiff > gran)
return 1;
return 0;
}

这个函数返回-1表示新进程vruntime大于当前进程，当然不能抢占。

返回0表示尽管新进程vruntime比当前进程小。可是没有小到调度粒度，一般也不能抢占

返回1表示新进程vruntime比当前进程小的超过了调度粒度，能够抢占。

调度粒度是什么概念呢？这个也非常好理解，仅仅是须要对前面的概念作出一点调整，

前面说每次都简单选择vruntime最小的进程调度，事实上也不全然是这样。

如果进程A和B的vruntime非常接近。那么A先执行了一个tick。vruntime比B大了，

B又执行一个tick，vruntime又比A大了。又切换到A。这样就会在AB间频繁切换。对性能影响非常大。

因此假设当前进程的时间没实用完，就仅仅有当有进程的vruntime比当前进程小超过调度粒度时。

才干进行进程切换。

函数上面凝视中那个图就是这个意思，我们看下：

横坐标表示vruntime。s1 s2 s3分别表示新进程，c表示当前进程，g表示调度粒度。

s3肯定能抢占c。而s1不可能抢占c。

s2尽管vruntime比c小。可是在调度粒度之内，是否能抢占要看情况，像如今这样的状况就不能抢占。

到这里，创建进程时的调度相关代码就介绍完了。

三、唤醒进程

我们再看看唤醒进程时的CFS动作。看下函数try_to_wake_up。非常长的函数，仅仅留几行代码

/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
* @state: the mask of task states that can be woken
* @sync: do a synchronous wakeup?
*
* Put it on the run-queue if it's not already there. The "current"
* thread is always on the run-queue (except when the actual
* re-schedule is in progress), and as such you're allowed to do
* the simpler "current->state = TASK_RUNNING" to mark yourself
* runnable without the overhead of this.
*
* returns failure only if the task is already active.
*/
static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
{
int cpu, orig_cpu, this_cpu, success = 0;
unsigned long flags;
struct rq *rq;
rq = task_rq_lock(p, &flags);
if (p->se.on_rq)
goto out_running;
update_rq_clock(rq);
activate_task(rq, p, 1);
success = 1;
out_running:
check_preempt_curr(rq, p, sync);
p->state = TASK_RUNNING;
out:
current->se.last_wakeup = current->se.sum_exec_runtime;
task_rq_unlock(rq, &flags);
return success;
}

update_rq_clock就是更新cfs_rq的时钟，保持与系统时间同步。

重点是activate_task，它将进程增加红黑树而且对vruntime做一些调整。

然后用check_preempt_curr检查是否构成抢占条件。假设能够抢占则设置TIF_NEED_RESCHED标识。

由于check_preempt_curr讲过了，我们仅仅顺着以下的顺序走一遍

activate_task

-->enqueue_task

-->enqueue_task_fair

-->enqueue_entity

-->place_entity

static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
{
if (task_contributes_to_load(p))
rq->nr_uninterruptible--;
enqueue_task(rq, p, wakeup);
inc_nr_running(rq); //执行队列上的就绪任务多了一个
}
static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
{
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1; //被唤醒的任务会将on_rq设为1
}
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
for_each_sched_entity(se) {
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
enqueue_entity(cfs_rq, se, wakeup);
wakeup = 1;
}
hrtick_update(rq);
}
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
account_entity_enqueue(cfs_rq, se);
if (wakeup) {
place_entity(cfs_rq, se, 0);
enqueue_sleeper(cfs_rq, se);
}
update_stats_enqueue(cfs_rq, se);
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se); //把进程增加CFS红黑树
}

这里还须要再看一遍place_entity，前面尽管看过一次，可是第三个參数不一样。

当參数3为0的时候走的是还有一条路径，我们看下

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;
/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
unsigned long thresh = sysctl_sched_latency;
/*
* convert the sleeper threshold into virtual time
*/
if (sched_feat(NORMALIZED_SLEEPER))
thresh = calc_delta_fair(thresh, se);
vruntime -= thresh;
}
/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se->vruntime, vruntime);
}
se->vruntime = vruntime;
}

initial不同一时候两条路径有什么不同呢？路径1是新创建任务的时候对其vruntime进行初始化，将它放在红黑树右端。

而以下这条路是唤醒睡眠任务时的代码，我们设想一个任务睡眠了非常长时间，它的vruntime就一直不会更新，

这样当它醒来时vruntime会远远小于执行队列上的不论什么一个任务，于是它会长期占用CPU，这显然是不合理的。

所以这要对唤醒任务的vruntime进行一些调整，我们能够看到。这里是用min_vruntime减去一个thresh,

这个thresh的计算过程就是将sysctl_sched_latency换算成进程的vruntime，而这个sysctl_sched_latency

就是默认的调度周期。单核CPU上一般为20ms。之所以要减去一个值是为了对睡眠进程做一个补偿，

能让它醒来时能够高速的到CPU。

个人感觉这个设计很聪明，曾经O(1)调度器有一个复杂的公式（到如今我也没能记住），

用来区分交互式进程和CPU密集型进程，详情请參考ULK等书。而如今CFS无须再使用那个复杂的公式了。

仅仅要是常睡眠的进程，它被唤醒时一定会有非常小的vruntime。能够立马执行，省却了非常多特殊情况的处理。

同一时候还要注意那句凝视 ensure we never gain time by being placed backwards。

本来这里是给由于长时间睡眠而vruntime远远小于min_vruntime的进程补偿的，可是有些进程仅仅睡眠非常短时间，

这样在它醒来后vruntime还是大于min_vruntime，不能让进程通过睡眠获得额外的执行时间。所以最后选择

计算出的补偿时间与进程原本vruntime中的较大者。

到这里，place_entity就讲完了。

可是我有一个问题，为什么计算thresh要用整个调度周期换算成vruntime？

个人感觉应该用(调度周期 * 进程权重 / 全部进程总权重)再换算成vruntime才合理阿，用整个调度周期

是不是补偿太多了？

四、进程调度schedule

以下看下主动调度代码schedule。

/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
int cpu;
need_resched:
preempt_disable(); //在这里面被抢占可能出现故障。先禁止它！
cpu = smp_processor_id();
rq = cpu_rq(cpu);
rcu_qsctr_inc(cpu);
prev = rq->curr;
switch_count = &prev->nivcsw;
release_kernel_lock(prev);
need_resched_nonpreemptible:
spin_lock_irq(&rq->lock);
update_rq_clock(rq);
clear_tsk_need_resched(prev); //清除须要调度的位
//state==0是TASK_RUNNING，不等于0就是准备睡眠，正常情况下应该将它移出执行队列
//可是还要检查下是否有信号过来。假设有信号而且进程处于可中断睡眠就唤醒它
//注意对于须要睡眠的进程。这里调用deactive_task将其移出队列而且on_rq也被清零
//这个deactivate_task函数就不看了，非常easy
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
if (unlikely(signal_pending_state(prev->state, prev)))
prev->state = TASK_RUNNING;
else
deactivate_task(rq, prev, 1);
switch_count = &prev->nvcsw;
}
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
//这两个函数都是重点，我们以下分析
prev->sched_class->put_prev_task(rq, prev);
next = pick_next_task(rq, prev);
if (likely(prev != next)) {
sched_info_switch(prev, next);
rq->nr_switches++;
rq->curr = next;
++*switch_count;
//完毕进程切换，不讲了，跟CFS没关系
context_switch(rq, prev, next); /* unlocks the rq */
/*
* the context switch might have flipped the stack from under
* us, hence refresh the local variables.
*/
cpu = smp_processor_id();
rq = cpu_rq(cpu);
} else
spin_unlock_irq(&rq->lock);
if (unlikely(reacquire_kernel_lock(current) < 0))
goto need_resched_nonpreemptible;
preempt_enable_no_resched();
//这里新进程也可能有TIF_NEED_RESCHED标志，假设新进程也须要调度则再调度一次
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}

首先看put_prev_task，它等于put_prev_task_fair，后者基本上就是直接调用put_prev_entity

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
*/
//记得这里的on_rq吗？在schedule函数中假设进程状态不是TASK_RUNNING，
//那么会调用deactivate_task将prev移出执行队列，on_rq清零。因此这里也是仅仅有当
//prev进程仍然在执行态的时候才须要更新vruntime等信息。
//假设prev进程由于被抢占或者由于时间到了而被调度出去则on_rq仍然为1
if (prev->on_rq)
update_curr(cfs_rq);
check_spread(cfs_rq, prev);
//这里也一样，仅仅有当prev进程仍在执行状态的时候才须要更新vruntime信息
//实际上正在cpu上执行的进程是不在红黑树中的，仅仅有在等待CPU的进程才在红黑树
//因此这里将调度出的进程又一次增加红黑树。

on_rq并不代表在红黑树中，而是代表在执行状态
if (prev->on_rq) {
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
//这个函数也不跟进去了，就是把进程以(vruntime-min_vruntime)为key增加到红黑树中
__enqueue_entity(cfs_rq, prev);
}
//没有当前进程了。这个当前进程将在pick_next_task中更新
cfs_rq->curr = NULL;
}

再回到schedule中看看pick_next_task函数。基本上也就是直接调用pick_next_task_fair

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
struct task_struct *p;
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
if (unlikely(!cfs_rq->nr_running))
return NULL;
do {
//这两个函数是重点，选择下一个要运行的任务
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
p = task_of(se);
hrtick_start_fair(rq, p);
return p;
}

主要看下pick_next_entity和set_next_entity

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
//__pick_next_entity就是直接选择红黑树缓存的最左结点，也就是vruntime最小的结点
struct sched_entity *se = __pick_next_entity(cfs_rq);
//以下的wakeup_preempt_entity已经讲过。忘记的同学能够到上面看下
//这里就是前面所说的优先照应next和last进程，仅仅有当__pick_next_entity选出来的进程
//的vruntime比next和last都小超过调度粒度时才轮到它执行，否则就是next或者last
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
return cfs_rq->next;
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
return cfs_rq->last;
return se;
}
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/* 'current' is not kept within the tree. */
//这里什么情况下条件会为假？我以为刚唤醒的进程可能不在rq上
//可是回到上面去看了下，唤醒的进程也通过activate_task将on_rq置1了
//新创建的进程on_rq也被置1，这里什么情况会为假，想不出来
//这里我也測试了一下。在条件为真假的路径上各设置了一个计数器
//当条件为真经过了将近五十万次的时候条件为假仅有一次。
//所以我们能够觉得基本上都会直接进入if语句块执行
if (se->on_rq) {
/*这里凝视是不是写错了？dequeued写成了enqueued?
* Any task has to be enqueued before it get to execute on
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
update_stats_wait_end(cfs_rq, se);
//就是把结点从红黑树上取下来。

前面说过，占用CPU的进程不在红黑树上
__dequeue_entity(cfs_rq, se);
}
update_stats_curr_start(cfs_rq, se);
cfs_rq->curr = se; //OK。在put_prev_entity中清空的curr在这里被更新
//将进程执行总时间保存到prev_..中。这样进程本次调度的执行时间能够用以下公式计算：
//进程本次执行已占用CPU时间 = sum_exec_runtime - prev_sum_exec_runtime
//这里sum_exec_runtime会在每次时钟tick中更新
se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

到此schedule函数也讲完了。

关于dequeue_task，dequeue_entity和__dequeue_entity三者差别

前两者差不太多。不同的那一部分我也没看明确。。。主要是它们都会将on_rq清零。

我认为是当进程要离开TASK_RUNNING状态时调用。这两个函数能够将进程取下执行队列。

而__dequeue_entity不会将on_rq清零，仅仅是将进程从红黑树上取下，

我认为一般用在进程将获得CPU的情况，这时须要将它从红黑树取下，可是还要保留在rq上。

五、时钟中断

接下来的情景就是时钟中断，时钟中断在time_init_hook中初始化，中断函数为timer_interrupt

依照例如以下路径

timer_interrupt

-->do_timer_interrupt_hook

-->这里有一个回调函数，在我机器上測试调用的是tick_handle_oneshot_broadcast

-->从tick_handle_oneshot_broadcast后面一部分过程怎么走的没搞清楚，有时间用kgdb跟踪一下

-->反正最后是到了tick_handle_periodic

-->tick_periodic

-->update_process_times

-->scheduler_tick 这里面跟CFS相关的主要就是更新了cfs_rq的时钟

-->通过回调函数调到task_tick_fair，没作什么事，直接进入了entity_tick

-->entity_tick这个函数看下

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
//....无关代码
if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
check_preempt_tick(cfs_rq, curr);
}

entity_tick函数就是更新状态信息，然后检測是否满足抢占条件。

前面我们一直忽略update_curr,这里须要看一下了

static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_of(cfs_rq)->clock; //这个clock刚刚在scheduler_tick中更新过
unsigned long delta_exec;
/*
* Get the amount of time the current task was running
* since the last time we changed load (this cannot
* overflow on 32 bits):
*/
//exec_start记录的是上一次调用update_curr的时间，我们用当前时间减去exec_start
//就得到了从上次计算vruntime到如今进程又执行的时间，用这个时间换算成vruntime
//然后加到vruntime上，这一切是在__update_curr中完毕的
delta_exec = (unsigned long)(now - curr->exec_start);
__update_curr(cfs_rq, curr, delta_exec);
curr->exec_start = now;
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
}
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
unsigned long delta_exec)
{
unsigned long delta_exec_weighted;
//前面说的sum_exec_runtime就是在这里计算的，它等于进程从创建開始占用CPU的总时间
curr->sum_exec_runtime += delta_exec;
//以下变量的weighted表示这个值是从执行时间考虑权重因素换算来的vruntime。再写一遍这个公式
//vruntime(delta_exec_weighted) = 实际执行时间(delta_exe) * 1024 / 进程权重
delta_exec_weighted = calc_delta_fair(delta_exec, curr);
//将进程刚刚执行的时间换算成vruntime后立马加到进程的vruntime上。
curr->vruntime += delta_exec_weighted;
//由于有进程的vruntime变了，因此cfs_rq的min_vruntime可能也要变化，更新它。
//这个函数不难，就不跟进去了，就是先取tmp = min(curr->vruntime,leftmost->vruntime)
//然后cfs_rq->min_vruntime = max(tmp, cfs_rq->min_vruntime)
update_min_vruntime(cfs_rq);
}

OK。更新完CFS状态之后回到entity_tick中，这时须要检測是否满足抢占条件。这里也是CFS的关键之中的一个

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;
//这里sched_slice跟上面讲过的sched_vslice非常象，只是sched_vslice换算成了vruntime，
//而这里这个就是实际时间，没有经过换算，返回值的就是此进程在一个调度周期中应该执行的时间
ideal_runtime = sched_slice(cfs_rq, curr);
//上面提到过这个公式了，计算进程已占用的CPU时间。假设超过了应该占用的时间（ideal_runtime）
//则设置TIF_NEED_RESCHED标志，在退出时钟中断的过程中会调用schedule函数进行进程切换
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime)
resched_task(rq_of(cfs_rq)->curr);
}

附:一个小測试

看prio_to_weight数组，不同nice值的进程权重差距相当大。最大权重是最小权重的6000倍左右，

是不是意味着假设同一时候有两个进程A和B在系统中执行。A的nice为－20，B的为19，那么A分配的执行时间

是B的6000倍呢？没错。我做了一个实验，先执行例如以下程序，

int main()
{
errno = 0;
if(fork()) {
setpriority(PRIO_PROCESS, 0, -20);
} else {
setpriority(PRIO_PROCESS, 0, 19);
}
if(errno) {
printf("failed/n");
exit(EXIT_FAILURE);
}
printf("pid:%d/n", getpid());
while(1);
return 0;
}

然后再插入例如以下模块

#define T1 (第一个进程的pid)
#define T2 (第二个进程的pid)
static int __init sched_test_init(void)
{
struct task_struct *p;
for_each_process(p) {
if(p->pid == T1 || p->pid == T2)
printk("%d runtime:%llu/n", p->pid, p->se.sum_exec_runtime);
}
return -1; //返回-1防止模块真正插入，我们仅仅须要打印出上面的信息就能够了。
}

再dmesg查看结果，我測试过两次。一次设置nice分别为0和6，那么权重之比为

1024 / 272 = 3.7647。结果执行时间之比为 146390068851 / 38892147894 = 3.7640

能够看到结果相当接近，

还有一次设置nice分别为-20和19，权重之比为88761 / 15 = 5917.4000，

结果执行时间之比为187800781308 / 32603290 = 5760.1788。也非常接近。

能够看到，上面的权重与执行时间成正比的结论是成立的。

实际上，当我执行一个nice为-20的程序后，整个系统很卡，差点儿成了幻灯片，

也说明nice值的不同带来的差距很明显。

另外有一点值得一提，尽管整个系统非常卡，可是对鼠标键盘的响应还是非常快。

我打字的时候差点儿不会有什么延迟，这也说明。尽管CFS没有通过复杂的经验公式区分

交互式进程。可是它的设计思路使他天然地对交互式进程的响应可能比O(1)调度还要好。

http://blog.csdn.net/dog250/article/details/5302850

（总结：1、在进程执行时，其vruntime稳定的添加，它在红黑树中总是向右移动。由于越重要的进程vruntime添加越慢，因此他们向右移动的速度也越慢。这样其被调度的机会要大于次要进程，这刚好是我们须要的。2、假设进程进入睡眠，则其vruntime保持不变，由于每一个队列min_vruntime同一时候会添加（由于他们是单调的），那么睡眠进程醒来后，在红黑树中的位置会更靠左。由于其键值变得更小了。注意底层使用的是红黑树结构）

巴特西

linux内核CFS进程调度策略

最新文章

热门文章