概述

可换出页

只有少量几种页可以换出到交换区,对其他页来说,换出到块设备上与之对应的后备存储器即可,如下所述。

类别为 MAP_ANONYMOUS 的页,没有关联到文件，例如,这可能是进程的栈或是使用 mmap 匿名映射的内存区。
进程的私有映射用于映射修改后不向底层块设备回写的文件,通常换出到交换区。
所有属于进程堆以及使用 malloc 分配的页
用于实现某种进程间通信机制的页。例如,用于在进程之间交换数据的共享内存页。

页颠簸

这个问题涉及交换区和物理内存之间密集的数据传输问题归结为页的来回、反复的交换。在系统进程的数目增加时,这种现象发生的几率也会增加。在换出重要的数据后不久再次需要该数据时,会发生这种现象。

为防止页颠簸,内核必须解决的主要的问题是,尽可能精确地确定一个进程的工作集 (working set,即使用最频繁的那些内存页),将最不重要的那些页移到交换区或其他后备存储器,而真正重要的页则一直驻留在内存中。

页交换算法

第二次机会

FIFO的改进版

LRU算法

least recently used:最近最少使用

Linux内核中的页面交换和回收

交换区的组织

换出的页或者保存在一个没有文件系统的专用分区中,或者存储在某个现存文件系统中的一个定长文件中。

还可以根据各个交换区的速度不,为其指定优先级。内核使用交换区时可以根据优先级进行选择。

每个交换区都细分为若干连续的槽(slot),每个槽的长度刚好与系统的一个页帧相同。在大多数处理器上,是4 KiB。但较新的系统通常会使用更大的页。

本质上,系统中的任何一页都可以容纳到交换区的任一槽中。但内核还使用了一种称为聚集(clustering)的构造法,使得能够尽快访问交换区。进程内存区中连续的页(或至少是连续换出的页)将按照特定的聚集大小(通常是256页)逐一写到硬盘上。如果交换区中没有更多空间可容纳此长度的聚集,内核可以使用其他任何位置上的空闲槽位。

如果使用了几个优先级相同的交换区,内核将使用一种循环进程来确保尽可能均匀地利用各个交换区。如果交换区的优先级不同,内核首先使用高优先级的交换区,然后逐渐转移到优先级较低的交换区。

为跟踪内存页在交换分区中的位置,内核必须维护一些数据结构,将该信息保存在内存中。结构中,最重要的数据成员是一个位图,用于跟踪交换区中各槽位的使用/空闲状态。其他成员包含的数据用于支持选择接下来使用的槽位,以及聚集的实现。

检查内存使用情况

选择要换出的页

内核混合使用了此前讨论的思想,实现了一种粗粒度的LRU方法,只使用了一种硬件特性,即在访问一页之后设置一个访问位,该功能在内核支持的所有体系结构上都可用,而且还可以毫不费力地进行仿真。

与通用的算法相比,内核对LRU的实现基于两个链表,分别称为活动链表和惰性链表(系统中的每个内存域都有这样的两个链表)。顾名思义,所有处于活动使用状态的页在一个链表上,而所有惰性页则保存在另一个链表上,这些页虽然可能映射到一个或多个进程,但不经常使用。为在两个链表之间分配页,内核需要定期执行均衡操作,通过上述访问位来确定一页是活动的还是惰性的,换言之,即该页是否经常被系统中的应用程序访问。页在两个链表之间可能会发生双向转移。页可以从活动链表转移到惰性链表,反之亦然。但这种转移不是每访问一页都会发生,它发生的时间间隔会比较长。

随着时间的推移,最不常用的页将收集到惰性链表的末尾。在出现内存不足时,内核将选择换出这些页。因为这些页到换出时,一直都很少使用,所以根据LRU原理,换出这些页对系统的破坏是最小的。

处理缺页异常

Linux运行的所有体系结构都支持缺页异常的概念,当访问虚拟地址空间中的一页,但该页不在物理内存中的时候,将触发缺页异常。缺页异常通知内核从交换区和其他的后备存储器读取缺失的数据。

缩减内核缓存

现在,用于缩减各种缓存的方法仍然是分别实现的,因为各种缓存的结构有很大的不同,很难采用一种通用的缓存收缩算法。但现在内核提供了一种通用框架,来管理各种缓存收缩方法。用于缩减缓存的函数在内核中称为收缩器(shrinker),可以动态注册。在缺乏内存时,内核将调用所有注册的收缩器来获得内存。

管理交换区

数据结构

交换区管理的基石是 mm/swap-info.c 中定义的swap_info数组,其中各数组项存储了关于系统中各个交换区的信息:

static struct swap_info_struct swap_info[MAX_SWAPFILES];//MAX_SWAPFILES默认32

内核使用交换文件(swap file)这个术语时,不仅是指用于页交换的文件,还包括交换分区,因此上述数组包括了这两种类型。

交换区的特征

<linux/swap.h>
struct swap_info_struct {
    unsigned int flags;
    int prio;           /* swap priority */
    struct file *swap_file;
    struct block_device *bdev;
    struct list_head extent_list;
    struct swap_extent *curr_swap_extent;
    unsigned old_block_size;
    unsigned short * swap_map;
    unsigned int lowest_bit;
    unsigned int highest_bit;
    unsigned int cluster_next;
    unsigned int cluster_nr;
    unsigned int pages;
    unsigned int max;
    unsigned int inuse_pages;
    int next;           /* next entry on swap list */
};

flags描述交换区的状态。SWP_USED 表明当前项在交换数组中处于使用状态。SWP_WRITEOK 指定当前项对应的交换区可写。在交换区插入到内核之后,这两个标志都会设置;二者合并后的缩写是 SWP_ACTIVE。
swap_file 指向与该交换区关联的 file 结构。对于交换分区,这是一个指向块设备上分区的设备文件的指针(在我们的例子中, /dev/hda5的情形即如此)。对于交换文件,该指针指向相关文件的 file 实例,即例子中 /mnt/swap1 或 /tmp/swap2 的情形。
bdev 指向文件/分区所在底层块设备的 block_device 结构。
交换区的相对优先级保存在 prio 成员中。
pages 保存了交换区中可用槽位的总数
max 保存了交换区当前包含的页数。不同于 pages ,该成员不仅计算可用的槽位,而且包括那些(例如,因为块设备故障)损坏或用于管理目的的槽位。
swap_map 是一个指针,指向一个短整型数组,其中包含的项数与交换区槽位数目相同。该数组用作一个访问计数器,每个数组项都表示共享对应换出页的进程的数目。
next 成员来建立一个相对的顺序

内核还在 mm/swapfile.c 中定义了全局变量 swap_list。它是 swap_list_t 数据类型的一个实例,该类型是专门为查找第一个(高优先级)交换区而定义的

为了减少扫描整个交换区查找空闲槽位的搜索时间,内核借助 lowest_bit 和 highest_bit 成员,来管理搜索区域的下界和上界。在 lowest_bit 之下和 highest_bit 之上,是没有空闲槽位的,因而搜索相关区域是无意义的。
内核还提供了两个成员,分别是 cluster_next 和 cluster_nr ,以实现上文简要提到的聚集技术。前者指定了在交换区中接下来使用的槽位(在某个现存聚集中)的索引,而 cluster_nr表示当前聚集中仍然可用的槽位数,在消耗了这些空闲槽位之后,则必须建立一个新的聚集,否则(如果没有足够空闲槽位可用于建立新的聚集)就只能进行细粒度分配了(即不再按聚集分配槽位)。

用于实现非连续交换区的区间

在使用文件作为交换区时,情况会更复杂,因为无法保证文件的所有块在磁盘上是连续的。

因此将各个不连续块定义为区间

区间结构 struct swap_extent 定义如下:

<linux/swap.h>
struct swap_extent {
    struct list_head list;
    pgoff_t start_page;
    pgoff_t nr_pages;
    sector_t start_block;
};

list 是一个链表元素,用于将区间结构置于一个标准双链表上进行管理
区间中第一个槽位的编号保存在 start_page 中。
nr_pages 指定了区间中可容纳页的数目。
start_block 是区间的第一块在硬盘上的块号。

swap_info_struct 中一个额外的成员 curr_swap_extent 用于保存一个指针,指向区间链表中上一次访问的 swap_extent 实例。每次新的搜索都从该实例开始。因为通常是对连续的槽位进行访问,所搜索的块通常会位于该区间或下一个区间。

创建交换区

mkswap命令:

将所需交换区的长度除以所述机器的页长度,以确定其中能够容纳的页数。
逐一检查交换区的各个磁盘块是否有读写错误,以确定有缺陷的区域。因为交换区的页长度将使用机器的页长度,因而一个坏块就意味着交换区的容量减少了一页。
将一个包含所有坏块地址的列表,写入到交换区的第一页。
为向内核标识此类交换区,在第一页末尾设置了 SWAPSPACE2 标记。
可用槽位数目也存储在交换区头部。该值是通过从总的可用槽位数目中减去坏块数目而得到的。还必须从中减去1,因为第一页用于存储状态信息和坏块列表。

激活交换区

swapon->sys_swapon:

第一步,内核在 swap_info 数组中查找一个空闲数组项,并向该项指定初始值。如果将一个块设备分区用作交换区,则用 bd_claim 获取相关的 block_device 实例。
在已经打开交换文件(或交换分区)之后,读入第一页包含的坏块信息和交换区的长度。
setup_swap_extents 初始化区间链表。
最后一步,根据新交换区的优先级,将其添加到交换区的列表。

交换缓存

内核利用了另一个缓存,称为交换缓存(swap cache),该缓存在选择换出页的操作和实际执行页交换的机制之间,充当协调者。

在页面选择策略和用于在内存和交换区之间传输数据的机制之间,交换缓存充当代理人的角色。这两个部分通过交换缓存交互。一方的输入会触发另一方的相应操作。不过请注意,对无须换出但可以同步的页来说,策略例程是可以直接与回写例程交互的。

在换出页时,页面选择逻辑首先选择一个适当的、很少使用的页帧。该页帧缓冲在页缓存中,然后将其转移到交换缓存。

如果换出页由几个进程在同时使用,内核必须设置进程页目录中的对应页表项,使之指向交换文件中相关的位置。在其中某个进程访问该页的数据时,该页将再次换入,该进程对应此页的页表项将设置为该页当前的内存地址。但是,这会导致一个问题。其他进程的对应页表项仍然指向交换文件中的位置,因为尽管可以确定共享一页的进程数目,却不可能确定具体是哪些进程在共享该页。

因而在换入共享页时,它们将停留在交换缓存中,直至所有进程都已经从交换区请求该页,并都知道了该页在内存中新的位置为止。

没有交换缓存的帮助,内核不能确定一个共享的内存页是否已经换入内存,将不可避免地导致对数据的冗余读取。

在换出共享页时, rmap 查找引用该页数据的所有进程。因而,引用该页的所有进程中的相关页表项都可以更新,指向交换区中对应的位置。这意味着,该页的数据可以立即换出,而无须在交换缓存中保持很长一段时间。

标记换出页

在换出页的页表项中,所有CPU都会存储下列信息。

一个标志,表明页已经换出。
该页所在交换区的编号。
对应槽位的偏移量,用于在交换区中查找该页所在的槽位。

<swap.h>
typedef struct {
    unsigned long val;
}swp_entry_t;

由于交换缓存仅仅是一个使用 long 作为键值的页缓存,换出页可以用这种方法唯一标识。

由于这种情形在未来可能发生改变,因而没有直接使用 unsigned long 值,而是将其隐藏到一个结构中。因为 swap_entry_t 值的内容只能通过专用函数访问,即使未来的内核版本修改页表项的内部表示,也无须重写页交换的实现。

<linux/swapops.h>
#define SWP_TYPE_SHIFT(e)   (sizeof(e.val) * 8 - MAX_SWAPFILES_SHIFT) //MAX_SWAPFILES_SHIFT = 5
#define SWP_OFFSET_MASK(e)  ((1UL << SWP_TYPE_SHIFT(e)) - 1)

static inline unsigned swp_type(swp_entry_t entry)
{
    return (entry.val >> SWP_TYPE_SHIFT(entry));
}

static inline pgoff_t swp_offset(swp_entry_t entry)
{
    return entry.val & SWP_OFFSET_MASK(entry);
}

static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)//根据给定的偏移量和和类型，生成一个swp_entry_t
{
    swp_entry_t ret;

    ret.val = (type << SWP_TYPE_SHIFT(ret)) |
            (offset & SWP_OFFSET_MASK(ret));
    return ret;
}

交换缓存的结构

就数据结构而言，交换缓存不过是一个页缓存。其实现的核心是 swapper_space对象。

<mm/swap_state.c>

static const struct address_space_operations swap_aops = {
    .writepage  = swap_writepage,
    .sync_page  = block_sync_page,
    .set_page_dirty = __set_page_dirty_nobuffers,
    .migratepage    = migrate_page,
};

struct address_space swapper_space = {
    .page_tree  = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
    .tree_lock  = __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
    .a_ops      = &swap_aops,
    .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
    .backing_dev_info = &swap_backing_dev_info,
};

内核提供了一组交换缓存访问函数,可以由任何涉及内存管理的内核代码使用。例如,这些函数可用于向交换缓存添加页,或查找交换缓存中的页。这些函数构成了交换缓存和页面替换逻辑之间的接口,因而可用于发出换入/换出页的命令,而无须关注此后数据如何传输的技术细节。

内核还提供了一组函数,来处理通过交换缓存提供的地址空间。与地址空间和页缓存类似,这些函数聚集在一个 address_space_operations 实例中,通过 a_ops 成员关联到 swapper_space 。这些函数构成了交换缓存“向下”的接口,换言之,在交换缓存下是在系统的交换区和物理内存之间传输数据的实现部分,这些函数是交换缓存与数据传输部分之间的接口。与稍早提到的函数集不同,这些例程并不关注换出/换入哪些页,而负责对选定页进行数据传输的技术细节。

swap_writepage 将脏页与底层块设备同步。其目的并非像其他页缓存那样,用来维护物理内存和块设备之间的一致性。其目的是将页从交换缓存移除,将其数据传输到交换区。因而,该函数负责从物理内存到磁盘上交换区的数据传输。
页必须在交换缓存中标记为“脏”,而决不能分配新的内存,因为在使用换出机制时,内存资源肯定已经匮乏到一定程度了。一种将页标记为脏的可能做法是创建缓冲区,使数据逐块回写。但这种做法需要额外的内存来保存 buffer_head 实例(其中包括所需的管理数据)。因为交换缓存中只回写整页,所以这是无意义的。因此内核使用 __set_page_dirty_nobuffers 函数来将页标记为脏,它设置 PG_dirty 标志但并不创建缓冲区。
与多数页缓存一样,交换缓存也使用了内核的标准实现( block_sync_page )来将页同步到交换区。该函数只负责拔出对应的块设备队列。就交换缓存而言,这意味着接下来将执行发送给块层的所有数据传输请求。

添加新页

在内核想要主动换出一页时,会调用 add_to_swap ,即当策略算法确定可用内存不足时。该例程不仅将页添加到交换缓存(在页数据写出到磁盘之前,会一直停留在其中),还在某个交换区中为该页分配一个槽位。
当从交换区读入由几个进程共享的一页(可以根据交换区中的使用计数器判定)时,该页将同时保持在交换区和交换缓存中,直至被再次换出,或被所有共享该页的进程换入。内核通过add_to_swap_cache 函数实现该行为,该函数将一页添加到交换缓存,而不对交换区进行操作。

分配槽位

核将该任务委托给 get_swap_page ,该函数没有参数,将返回接下来所使用槽位的编号。

<mm/swapfile.c>

swp_entry_t get_swap_page(void)
{
    struct swap_info_struct *si;
    pgoff_t offset;
    int type, next;
    int wrapped = 0;

    spin_lock(&swap_lock);
    if (nr_swap_pages <= 0)
        goto noswap;
    nr_swap_pages--;

    for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
        si = swap_info + type;
        next = si->next;
        if (next < 0 ||
            (!wrapped && si->prio != swap_info[next].prio)) {//第二种情况下，重头开始
            next = swap_list.head;
            wrapped++;
        }

        if (!si->highest_bit)
            continue;
        if (!(si->flags & SWP_WRITEOK))
            continue;

        swap_list.next = next;
        offset = scan_swap_map(si);//扫描槽位分配图
        if (offset) {
            spin_unlock(&swap_lock);
            return swp_entry(type, offset);
        }
        next = swap_list.next;
    }

    nr_swap_pages++;
noswap:
    spin_unlock(&swap_lock);
    return (swp_entry_t) {0};
}

static inline unsigned long scan_swap_map(struct swap_info_struct *si)
{
    unsigned long offset, last_in_cluster;
    int latency_ration = LATENCY_LIMIT;

    /*
     * We try to cluster swap pages by allocating them sequentially
     * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
     * way, however, we resort to first-free allocation, starting
     * a new cluster.  This prevents us from scattering swap pages
     * all over the entire swap partition, so that we reduce
     * overall disk seek times between swap pages.  -- sct
     * But we do now try to find an empty cluster.  -Andrea
     */
    //cluster_nr指定当前聚集中，空闲槽位的数目
    //culster_next：接下来使用槽位的索引

    si->flags += SWP_SCANNING;
    if (unlikely(!si->cluster_nr)) {//聚集中没有空闲项
        si->cluster_nr = SWAPFILE_CLUSTER - 1;
        if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER)
            goto lowest;
        spin_unlock(&swap_lock);

        offset = si->lowest_bit;
        last_in_cluster = offset + SWAPFILE_CLUSTER - 1;

        /* Locate the first empty (unaligned) cluster */
        for (; last_in_cluster <= si->highest_bit; offset++) {
            if (si->swap_map[offset])
                last_in_cluster = offset + SWAPFILE_CLUSTER;
            else if (offset == last_in_cluster) {
                spin_lock(&swap_lock);
                si->cluster_next = offset-SWAPFILE_CLUSTER+1;
                goto cluster;
            }
            if (unlikely(--latency_ration < 0)) {
                cond_resched();
                latency_ration = LATENCY_LIMIT;
            }
        }
        spin_lock(&swap_lock);
        goto lowest;
    }
    //当前聚集中，存在空闲槽
    si->cluster_nr--;
cluster:
    offset = si->cluster_next;
    if (offset > si->highest_bit)
lowest:     offset = si->lowest_bit;
checks: if (!(si->flags & SWP_WRITEOK))
        goto no_page;
    if (!si->highest_bit)
        goto no_page;
    if (!si->swap_map[offset]) {
        if (offset == si->lowest_bit)
            si->lowest_bit++;
        if (offset == si->highest_bit)
            si->highest_bit--;
        si->inuse_pages++;
        if (si->inuse_pages == si->pages) {
            si->lowest_bit = si->max;
            si->highest_bit = 0;
        }
        si->swap_map[offset] = 1;
        si->cluster_next = offset + 1;
        si->flags -= SWP_SCANNING;
        return offset;
    }

    spin_unlock(&swap_lock);
    while (++offset <= si->highest_bit) {
        if (!si->swap_map[offset]) {
            spin_lock(&swap_lock);
            goto checks;
        }
        if (unlikely(--latency_ration < 0)) {
            cond_resched();
            latency_ration = LATENCY_LIMIT;
        }
    }
    spin_lock(&swap_lock);
    goto lowest;

no_page:
    si->flags -= SWP_SCANNING;
    return 0;
}

分配交换区

在策略例程确定了需要换出的页之后, mm/filemap.c 中的 add_to_swap 开始运作。该函数接受一个 struct page 实例作为参数,并将换出请求转发给页交换的技术实现部分。

<mm/swap_state.c>
/**
 * add_to_swap - allocate swap space for a page
 * @page: page we want to move to swap
 *
 * Allocate swap space for the page and add the page to the
 * swap cache.  Caller needs to hold the page lock.
 */
int add_to_swap(struct page * page, gfp_t gfp_mask)
{
    swp_entry_t entry;
    int err;

    BUG_ON(!PageLocked(page));

    for (;;) {
        entry = get_swap_page();//在交换区中获得一个槽位
        if (!entry.val)
            return 0;

        /*
         * Radix-tree node allocations from PF_MEMALLOC contexts could
         * completely exhaust the page allocator. __GFP_NOMEMALLOC
         * stops emergency reserves from being allocated.
         *
         * TODO: this could cause a theoretical memory reclaim
         * deadlock in the swap out path.
         */
        /*
         * Add it to the swap cache and mark it dirty
         */
        err = __add_to_swap_cache(page, entry,
                gfp_mask|__GFP_NOMEMALLOC|__GFP_NOWARN);//与add_to_page_cache 非常类似，主要区别在于将对 page 实例设置 PG_swapcache 标志并将交换标识符 swp_entry_t 保存在 page 的private成员中

        switch (err) {
        case 0:             /* Success */
            SetPageUptodate(page);
            SetPageDirty(page);//设置成脏页（页缓存中的页在变脏后才会与底层块设备同步。对于交换页来说,对应的底层块设备是交换区,因而同步(几乎)就等价于将页换出）
            INC_CACHE_INFO(add_total);
            return 1;
        case -EEXIST:
            /* Raced with "speculative" read_swap_cache_async */
            INC_CACHE_INFO(exist_race);
            swap_free(entry);
            continue;
        default:
            /* -ENOMEM radix-tree allocation failure */
            swap_free(entry);
            return 0;
        }
    }
}

就策略例程来说,只需要知道:内核会负责将页数据实际写出到交换区,因而在调用 add_to_swap 之后,就释放了一页。

缓存交换页

与 add_to_swap 不同, add_to_swap_cache 将一页添加到交换缓存,但要求已经为该页分配了一个槽位。

如果一页已经有了对应的槽位,为什么将其添加到交换缓存呢?在换入页时,这是需要的。假定已经换出了由许多进程共享的一页。在该页再次换入时,在第一个进程将其换入后,必须将其数据保持在交换缓存中,直至所有进程都完成了对该页的换入。只有到这时,才能将该页从交换缓存移除,因为此时所有相关用户进程都已经得知该页在内存中新的位置。当对交换页进行预读时,也会以这种方式来使用交换缓存。在这种情况下,读入的页尚未因缺页异常而被请求,但很可能在稍后被请求。

add_to_swap_cache 比较简单。其基本任务是调用_ _add_to_swap_cache ,该函数将页添加到交换缓存,如同 add_to_swap 那样。但首先必须调用 swap_duplicate ,以确保该页已经有了一个对应的交换数据项。 swap_duplicate 还会将对应槽位的交换映射计数加1,这表明该页在多处被换出。

add_to_swap 和 add_to_swap_cache 的主要区别在于,后者没有设置 PG_uptodate 或 PG_dirty这两个标志。本质上,这意味着内核并不需要将该页写入到交换区,即二者的内容当前是同步的。

搜索一页

<mm/swap_state.c>

struct page * lookup_swap_cache(swp_entry_t entry)
{
    struct page *page;

    page = find_get_page(&swapper_space, entry.val);

    if (page)
        INC_CACHE_INFO(find_success);

    INC_CACHE_INFO(find_total);
    return page;
}

数据回写

swap_writepage

页面回收

概述

如果内核检测到在某个操作期间内存严重不足,将调用 try_to_free_pages 。该函数检查当前内存域中所有页,并释放最不常用的那些。
一个后台守护进程,名为 kswapd ,会定期检查内存使用情况,并检测即将发生的内存不足。可使用该守护进程换出页,作为预防措施,以防内核在执行其他操作期间发现内存不足。

上述两条代码路径,很快在 shrink_zone 函数中合并。对页面回收子系统的这两条代码路径来说,剩余的代码是相同的。

内核试图将页分类到两个LRU链表中:一个用于活动页,另一个用于不活动页。这些链表是按内存域管理的：

<mmzone.h>
struct zone {
...
    struct list_head active_list;
    struct list_head inactive_list;
...
}

有关回收页的数目、具体回收哪些页的决策,是按照下列步骤作出的。

shrink_zone 是从内存移除很少使用的页的入口点,在周期性的 kswapd 机制中调用。该方法,试图在一个内存负责两件事:通过在活动链表和惰性链表之间移动页(使用s hrink_active_list )域中维护活动页和不活动页的数目的均衡;还通过 shrink_cache ,控制了选择换出页的过程。在确定内存域中换出页数的逻辑和具体换出哪些页的决策之间, shrink_zone 充当了一个中间人。
shrink_active_list 是一个综合性的辅助函数,内核使用该函数在活动页和不活动页的两个链表之间移动页。该函数会被告知需要在两个链表之间转移的页数,而后该函数试图选择使用最少的页。因而在本质上, shrink_active_list 负责决定随后将换出哪些页,保留哪些页。换言之,该函数实现了页面选择的策略部分。
shrink_inactive_list 从给定内存域的惰性链表移除选定数目的不活动页,将其传送到shrink_page_list 函数,后者将向各个对应的后备存储器发出回写数据的请求,以便在物理内存中释放空间,回收所选定的页。

数据结构

页向量

页向量使得可以对一组 page 结构整体执行操作。

<linux/pagevec.h>
struct pagevec {
    unsigned long nr;
    unsigned long cold;
    struct page *pages[PAGEVEC_SIZE];
};

pagevec_release 将向量中所有页的使用计数器减1。如果某些页的使用计数器归0,即不再使用,则自动返回到伙伴系统。如果页在系统的某个LRU链表上,则从该链表移除,无论其使用计数器为何值。
pagevec_free 将一组页占用的内存空间返还给伙伴系统。调用者负责确认页的使用计数器为0(表明页在其他地方没有使用),且未包含在任何LRU链表中。
pagevec_release_nonlru 是另一个用于释放页的函数,它将一个给定页向量中所有页的使用计数器减1。在计数器归0时,对应页占用的内存将返还给伙伴系统。与 pagevec_release 不同,该函数假定向量中所有的页都不在任何LRU链表上。
pagevec_add 将一个新页添加到一个给出的页向量 pvec 。

LRU缓存

内核提供了另一个缓存,称为LRU缓存,以加速向系统的LRU链表添加页的操作。它利用页向量来收集 page 实例,将其逐组置于系统的活动链表或惰性链表上。这两个链表在内核中是一个热点,但必须通过自旋锁保护。为降低锁竞争的几率,新页不会立即添加到链表,而是首先缓冲到一个各CPU列表上:

lru_cache_add 首先将 page 实例的 count 使用计数器加1,因为该页现在已经在页缓存中(这被解释为“使用” )。接下来使用 pagevec_add 将该页添加到特定于CPU的页向量。

pagevec_add 返回的是添加新页之后页向量中仍然空闲的数组项数目。如果返回0,这表明添加上一个页之后页向量已经是满的,那么将调用 __pagevec_lru_add 。该函数将页向量中的所有页,都添加各页所属内存域的惰性链表中(页向量中的页,可能属于不同的内存域)。各页都设置了 PG_lru标志位,因为它们现在包含在一个LRU链表中。接下来将删除页向量的内容,以便为缓存中的新页腾出空间。

lru_cache_add_active 的工作方式与 lru_cache_add 完全相同,但前者用于活动页,后者用于不活动页。它使用 lru_add_pvecs_active 作为缓冲。在页从缓冲区转移到活动链表时,不仅会设置PG_lru 标志位,还将设置 PG_active 标志位。

确定页的活跃程度

为评估一页的重要性,内核不仅要跟踪该页是否由一个或多个进程使用,还需要跟踪其被访问的频繁程度。

因而引入了两个页标志,称为 referenced 和 active 。对应的标志位值分别是PG_referenced和PG_active 。

核心思想是,一个标志表示当前活动程度,而另一个标志表示页是否在最近被引用过。

如果页被认为是活动的,则设置 PG_active 标志;否则不设置。该标志是否设置,直接对应于页所在的LRU链表,即(特定于内存域的)活动链表或惰性链表。
每次访问该页时,都设置 PG_referenced 标志。负责该工作的是 mark_page_accessed 函数,内核必须确保适当地调用该函数。
PG_referenced 标志以及由逆向映射提供的信息用于确定页的活动程度。关键在于,每次清除 PG_referenced 标志时,都会检测页的活动程度。 page_referenced 函数实现了该行为。
再次进入 mark_page_accessed 。在它检查内存页时,如果发现 PG_referenced 标志位已经设置,这意味着 page_referenced 没有执行检查。因而对 mark_page_accessed 的调用必定比 page_referenced 更频繁,这意味着该页经常被访问。如果该页当前处于惰性链表上,则将其移动到活动链表。此外,还会设置 PG_active 标志位,清除 PG_referenced 标志位。
反向的转移也是可能的。如果页位于活动链表上,受到很多关注,那么通常会设置 PG_referenced 标志位。在页的活动减少时,如果要将其转入惰性链表,则需要两次 page_referenced调用,而其中不能插入 mark_page_accessed 调用。

如果一个不经常访问的页(因而是不活动的)的 PG_active 和 PG_referenced 标志位均未设置。这意味着,接下来需要两次 mark_page_accessed 调用(其中不能夹杂 page_referenced 调用),才能将其从惰性链表移动到活动链表。反之亦然:一个高度活动的页 ,同时设置了PG_active 和PG_referenced 标志位,也需要两次 page_referenced 调用(其间不能插入 mark_page_accessed 调用)才能从活动链表移动到惰性链表。

内核提供了几个辅助函数,支持在两个LRU链表之间移动页:

<linux/mm_inline.h>

void add_page_to_active_list(struct zone *zone, struct page *page)
void add_page_to_inactive_list(struct zone *zone, struct page *page)

void del_page_from_active_list(struct zone *zone, struct page *page)
void del_page_from_inactive_list(struct zone *zone, struct page *page)

void del_page_from_lru(struct zone *zone, struct page *page)

唯一要注意的是,如果调用者不知道页当前所在的LRU链表,则必须使用 del_page_from_lru。

将不活动页提升到活动链表:

<mm/swap.c>
void fastcall activate_page(struct page *page)
{
    struct zone *zone = page_zone(page);

    spin_lock_irq(&zone->lru_lock);
    if (PageLRU(page) && !PageActive(page)) {
        del_page_from_inactive_list(zone, page);
        SetPageActive(page);
        add_page_to_active_list(zone, page);
        __count_vm_event(PGACTIVATE);
    }
    spin_unlock_irq(&zone->lru_lock);
}

将页从活动链表移动到惰性链表的处理隐藏在一个更大的函数内部,即 shrink_active_list,在内部,该函数依赖于page_referenced 。除了按上述方法处理 PG_referenced 标志位置位,该函数还负责查询从页表引用该页的频繁程度。这主要应用了逆向映射机制。 page_referenced 需要参数 is_locked ,该参数表明所述页是否已经由调用者锁定。

<mm/rmap.c>

int page_referenced(struct page *page, int is_locked)
{
    int referenced = 0;

    if (page_test_and_clear_young(page))
        referenced++;

    if (TestClearPageReferenced(page))
        referenced++;

    if (page_mapped(page) && page->mapping) {//如果该页映射到某进程的地址空间中,那么对该页的引用必须通过页表中某些特定于硬件的标志位确定。
        if (PageAnon(page))
            referenced += page_referenced_anon(page);//在IA-32和AMD64体系结构上,这等于指向所述页且_ PAGE_BIT_ACCESSED 标志位置位的页表项的数目
        else if (is_locked)
            referenced += page_referenced_file(page);
        else if (TestSetPageLocked(page))
            referenced++;
        else {
            if (page->mapping)
                referenced += page_referenced_file(page);
            unlock_page(page);
        }
    }
    return referenced;
}

mark_page_accessed:

<mm/swap.c>
/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced    ->  inactive,referenced
 * inactive,referenced      ->  active,unreferenced
 * active,unreferenced      ->  active,referenced
 */
void fastcall mark_page_accessed(struct page *page)
{
    if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
        activate_page(page);
        ClearPageReferenced(page);
    } else if (!PageReferenced(page)) {
        SetPageReferenced(page);
    }
}

收缩内存域

控制扫描

<mm/vmscan.c>

struct scan_control {
    /* Incremented by the number of inactive pages that were scanned */
    unsigned long nr_scanned;

    /* This context's GFP mask */
    gfp_t gfp_mask;

    int may_writepage;

    /* Can pages be swapped as part of reclaim? */
    int may_swap;

    /* This context's SWAP_CLUSTER_MAX. If freeing memory for
     * suspend, we effectively ignore SWAP_CLUSTER_MAX.
     * In this context, it doesn't matter that we scan the
     * whole list at once. */
    int swap_cluster_max;

    int swappiness;

    int all_unreclaimable;

    int order;
};

nr_scanned 向调用者报告已经扫描到的不活动页的数目,用于在页面回收涉及的各个内核函数之间进行通信。
gfp_mask 指定了在调用页面回收函数的上下文环境下有效的页面分配标志。
may_writepage 指定了内核是否允许将页写出到后备存储器.
may_swap 确定了页面回收处理过程中是否允许页交换。
swap_cluster_max 实际上与页交换无关,它是一个阈值,表示一次页面回收步骤中,在各CPU列表中扫描的内存页数目的最小值。通常设置为 SWAP_CLUSTER_MAX ,该宏默认定义为32。
swappiness 控制内核换出页的积极程度,该值的范围在0到100之间。默认情况下,将使用vm_swappiness 。后者的标准设置为60,但可以通过/ proc/sys/vm/swappiness 调整。
all_unreclaimable 用于报告一种令人遗憾的情况,即所有内存域中的内存当前都是完全不可回收的。例如,在所有页都被 mlock 系统调用钉住时,就可能发生这种情况。
内核可以主动按给定的分配阶来尝试回收一组内存页。 order 表示分配阶,即要回收2个连续页。

struct zone ,其中包含了好些将在下文用到的字段:

<mmzone.h>
struct zone {
...
    unsigned long nr_scan_active;//只能扫描活动链表上的 nr_scan_active个
    unsigned long nr_scan_inactive;//只能扫描惰性链表上的 nr_scan_active个
    unsigned long pages_scanned;//pages_scanned 记录的是前一遍回收时扫描的页数
...
    /* 内存域统计量
    atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
...
}

实现

<mm/vmscan.c>

static unsigned long shrink_zone(int priority, struct zone *zone,
                struct scan_control *sc)
{
    unsigned long nr_active;
    unsigned long nr_inactive;
    unsigned long nr_to_scan;
    unsigned long nr_reclaimed = 0;

    /*
     * Add one to `nr_to_scan' just to make sure that the kernel will
     * slowly sift through the active list.
     */
    zone->nr_scan_active +=
        (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
    nr_active = zone->nr_scan_active;
    if (nr_active >= sc->swap_cluster_max)
        zone->nr_scan_active = 0;
    else
        nr_active = 0;

    zone->nr_scan_inactive +=
        (zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
    nr_inactive = zone->nr_scan_inactive;
    if (nr_inactive >= sc->swap_cluster_max)
        zone->nr_scan_inactive = 0;
    else
        nr_inactive = 0;

    while (nr_active || nr_inactive) {
        if (nr_active) {//如果扫描活动页,内核将使用 shrink_active_list 将页从活动链表移动到惰性链表。很自然,移动的是使用最少的活动页。
            nr_to_scan = min(nr_active,
                    (unsigned long)sc->swap_cluster_max);
            nr_active -= nr_to_scan;
            shrink_active_list(nr_to_scan, zone, sc, priority);
        }

        if (nr_inactive) {//不活动页可以通过 shrink_inactive_list 直接从缓存移除。该函数试图从惰性链表回收所需数目的页。返回值是实际上成功回收的页数。
            nr_to_scan = min(nr_inactive,
                    (unsigned long)sc->swap_cluster_max);
            nr_inactive -= nr_to_scan;
            nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
                                sc);
        }
    }

    throttle_vm_writeout(sc->gfp_mask);
    return nr_reclaimed;
}

每次调用 shrink_zone 时,将扫描的活动和不活动页的数目,即 nr_scan_active 和 nr_scan_inactive ,需要分别加上内存域中当前活动和不活动页的数目右移 priority 位(再加1),所加的数值约等于内存域中当前活动/不活动页数整除以2 priority 的商。之所以每次都加1,是要确保每次都至少会加1,即使在很长时间里移位操作的结果都是0;在某些系统负荷下,这是可能发生的。在这种情况下,加1也确保了迟早会填充惰性链表或收缩缓存。

如果加法操作之后,有某个值大于等于当前交换区中一个聚集可容纳的最大页数,则将对应的zone 成员设置为0,而局部变量局部变量 nr_active 或 nr_inactive 的值保持不变;否则, zone 成员的值不变,而局部变量设置为0。

这种行为确保:除非即将扫描的活动和不活动页的数目大于 sc->swap_cluster_max 指定的阈值,否则内核不会开始进一步的操作。

隔离LRU页和集中回收

内存域中,保存在链表上的活动和不活动页都需要由一个自旋锁保护,确切地说是 zone->lru_lock 。

isolate_lru_pages 函数负责从活动链表或惰性链表选择给定数目的页。这并不很困难:从链表末尾开始(这一点很重要,因为LRU算法中必须先扫描最陈旧的页),通过循环遍历链表,每步获取一页,将其移动到局部链表,直至所需页数达到为止。需要为每一页清除 PG_lru 标志位,因为该页现在已经不在LRU链表上了。也实现了所谓的集中回收算法。

<mm/vmscan.c>
static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
        struct list_head *src, struct list_head *dst,
        unsigned long *scanned, int order, int mode)
{
    unsigned long nr_taken = 0;
    unsigned long scan;

    for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {//for 循环会一直迭代下去,直至已经扫描的页数达到要求为止
        struct page *page;
        unsigned long pfn;
        unsigned long end_pfn;
        unsigned long page_pfn;
        int zone_id;

        page = lru_to_page(src);
        prefetchw_prev_lru_page(page, src, flags);

        VM_BUG_ON(!PageLRU(page));

        switch (__isolate_lru_page(page, mode)) {
        case 0:
            list_move(&page->lru, dst);
            nr_taken++;
            break;

        case -EBUSY:
            /* else it is being freed elsewhere */
            list_move(&page->lru, src);
            continue;

        default:
            BUG();
        }

        if (!order)//如果order没有给出想要的分配阶,那么每次循环在从LRU链表隔离出一页之后,都继续跳转到下一次循环
            continue;

        /*
         * Attempt to take all pages in the order aligned region
         * surrounding the tag page.  Only take those pages of
         * the same active state as that tag page.  We may safely
         * round the target page pfn down to the requested order
         * as the mem_map is guarenteed valid out to MAX_ORDER,
         * where that page is in a different zone we will detect
         * it from its zone id and abort this block scan.
         */
        zone_id = page_zone_id(page);
        page_pfn = page_to_pfn(page);
        pfn = page_pfn & ~((1 << order) - 1);//计算当前标记页对应页帧所落入的页帧区间。
        end_pfn = pfn + (1 << order);
        for (; pfn < end_pfn; pfn++) {
            struct page *cursor_page;

            /* The target page is in the block, ignore it. */
            if (unlikely(pfn == page_pfn))//目标页已经在块中,忽略
                continue;

            /* Avoid holes within the zone. */
            if (unlikely(!pfn_valid_within(pfn)))
                break;

            cursor_page = pfn_to_page(pfn);
            /* Check that we have not crossed a zone boundary. */
            if (unlikely(page_zone_id(cursor_page) != zone_id))//检查没有碰到内存域的边界。
                continue;
            switch (__isolate_lru_page(cursor_page, mode)) {
            case 0:
                list_move(&cursor_page->lru, dst);
                nr_taken++;
                scan++;
                break;

            case -EBUSY:
                /* else it is being freed elsewhere */
                list_move(&cursor_page->lru, src);
            default:
                break;
            }
        }
    }

    *scanned = scan;
    return nr_taken;
}

请注意, __isolate_lru_page 有一个额外的参数,可用于控制组成新的聚集的页的活动状态。有3个可能的选择:

mm/vmscan.c
#define ISOLATE_INACTIVE 0 /* 隔离不活动页。*/
#define ISOLATE_ACTIVE 1   /* 隔离活动页。*/
#define ISOLATE_BOTH 2     /* 活动和不活动页都隔离。*/

收缩活动页链表

将页从活动链表移动到惰性链表是页面回收的策略算法实现中的关键操作之一，因为此时需要评估系统中(或更确切地说,是在所述内存域中)各个页的重要性。

shrink_active_list：

使用 isolate_lru_pages 将所需数目(由 nr_pages 定义)的页从活动链表复制到一个局部的临时链表。
根据这些页的活动程度,将其分配到活动链表和惰性链表。
集中释放不重要的页。

<mm/vmscan.c>
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
                struct scan_control *sc, int priority)
{
    unsigned long pgmoved;
    int pgdeactivate = 0;
    unsigned long pgscanned;
    LIST_HEAD(l_hold);  /* The pages which were snipped off *///l_hold 保存仍然有待扫描的页,这些页在扫描之后才能确定其归宿。
    LIST_HEAD(l_inactive);  /* Pages to go onto the inactive_list */
    LIST_HEAD(l_active);    /* Pages to go onto the active_list */// l_active 和 l_inactive 分别保存在函数结束时将放回内存域的活动链表或惰性链表的页。
    struct page *page;
    struct pagevec pvec;
    int reclaim_mapped = 0;

    if (sc->may_swap) {
        long mapped_ratio;
        long distress;
        long swap_tendency;
        long imbalance;

        if (zone_is_near_oom(zone))
            goto force_reclaim_mapped;

        /*
         * `distress' is a measure of how much trouble we're having
         * reclaiming pages.  0 -> no problems.  100 -> great trouble.
         */
        distress = 100 >> min(zone->prev_priority, priority);//distress 是关键的标志,表示内核需要新内存的急切程度。该值是将固定值100右移 prev_priority 位计算而来 。prev_priority 指定了上一次 try_to_free_pages 运行期间扫描内存域的优先级。请注意, prev_priority 的值越低,相应的优先级越高。

        /*
         * The point of this algorithm is to decide when to start
         * reclaiming mapped memory instead of just pagecache.  Work out
         * how much memory
         * is mapped.
         */
        mapped_ratio = ((global_page_state(NR_FILE_MAPPED) + //mapped_ratio表示总的可用内存中已映射内存页的比例。
                global_page_state(NR_ANON_PAGES)) * 100) /
                    vm_total_pages;

        /*
         * Now decide how much we really want to unmap some pages.  The
         * mapped ratio is downgraded - just because there's a lot of
         * mapped memory doesn't necessarily mean that page reclaim
         * isn't succeeding.
         *
         * The distress ratio is important - we don't want to start
         * going oom.
         *
         * A 100% value of vm_swappiness overrides this algorithm
         * altogether.
         */
        swap_tendency = mapped_ratio / 2 + distress + sc->swappiness;sc_swappiness 是另一个内核参数,通常基于 /proc/sys/vm/swappiness 中的设置。

        /*
         * If there's huge imbalance between active and inactive
         * (think active 100 times larger than inactive) we should
         * become more permissive, or the system will take too much
         * cpu before it start swapping during memory pressure.
         * Distress is about avoiding early-oom, this is about
         * making swappiness graceful despite setting it to low
         * values.
         *
         * Avoid div by zero with nr_inactive+1, and max resulting
         * value is vm_total_pages.
         */
        imbalance  = zone_page_state(zone, NR_ACTIVE);
        imbalance /= zone_page_state(zone, NR_INACTIVE) + 1;//如果活动链表和惰性链表的长度之间存在较大的不平衡,内核将允许更容易地进行页交换和页面回收,以便平衡二者的长度

        /*
         * Reduce the effect of imbalance if swappiness is low,
         * this means for a swappiness very low, the imbalance
         * must be much higher than 100 for this logic to make
         * the difference.
         *
         * Max temporary value is vm_total_pages*100.
         */
        imbalance *= (vm_swappiness + 1);
        imbalance /= 100;

        /*
         * If not much of the ram is mapped, makes the imbalance
         * less relevant, it's high priority we refill the inactive
         * list with mapped pages only in presence of high ratio of
         * mapped pages.
         *
         * Max temporary value is vm_total_pages*100.
         */
        imbalance *= mapped_ratio;
        imbalance /= 100;

        /* apply imbalance feedback to swap_tendency */
        swap_tendency += imbalance;

        /*
         * Now use this metric to decide whether to start moving mapped
         * memory onto the inactive list.
         */
        if (swap_tendency >= 100)//核现在将所有计算出的信息归结为一个布尔值,来回答下述问题:是否需要换出映射页?如果 swap_tendency 大于或等于100,将会换出映射页,而 reclaim_mapped 设置为1。否则该变量保持其默认值0,因而只从页缓存回收页。
force_reclaim_mapped:
            reclaim_mapped = 1;//因为会将 vm_swappiness 加到 swap_tendency ,管理员可以在任何时间启用映射页的换出,只需要将 vm_swappiness 指定为100,就无须考虑其他系统参数的设置。
    }

    lru_add_drain();//将当前保存在LRU缓存中的数据分配到系统的LRU链表。
    spin_lock_irq(&zone->lru_lock);
    pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
                &l_hold, &pgscanned, sc->order, ISOLATE_ACTIVE);
    zone->pages_scanned += pgscanned;
    __mod_zone_page_state(zone, NR_ACTIVE, -pgmoved);
    spin_unlock_irq(&zone->lru_lock);

    while (!list_empty(&l_hold)) {//这一部分会将各个页分配到l_active 和 l_inactive 链表。
        cond_resched();
        page = lru_to_page(&l_hold);
        list_del(&page->lru);
        if (page_mapped(page)) {//检查该页是否嵌入到了某个进程的页表中（有关页是否映射在页表中的信息保存在各个 page 实例的 _mapcount 成员中，如果页由一个进程映射,该计数器值为0,未映射的页,其值为−1）
            if (!reclaim_mapped ||
                (total_swap_pages == 0 && PageAnon(page)) ||
                page_referenced(page, 0)) {//1.reclaim_mapped 等于0,即不回收映射页。2.系统没有交换区,而且刚刚检查的页注册为匿名页。3.页的进程的数目以及PG_Reference标志处理
                list_add(&page->lru, &l_active);
                continue;
            }
        }
        list_add(&page->lru, &l_inactive);//
    }

    pagevec_init(&pvec, 1);
    pgmoved = 0;
    spin_lock_irq(&zone->lru_lock);
    while (!list_empty(&l_inactive)) {
        page = lru_to_page(&l_inactive);
        prefetchw_prev_lru_page(page, &l_inactive, flags);
        VM_BUG_ON(PageLRU(page));
        SetPageLRU(page);
        VM_BUG_ON(!PageActive(page));
        ClearPageActive(page);

        list_move(&page->lru, &zone->inactive_list);
        pgmoved++;
        if (!pagevec_add(&pvec, page)) {//page 实例添加到一个页向量。
            __mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
            spin_unlock_irq(&zone->lru_lock);
            pgdeactivate += pgmoved;
            pgmoved = 0;
            if (buffer_heads_over_limit)
                pagevec_strip(&pvec);
            __pagevec_release(&pvec);//在页向量填满时,对页向量调用 __pagevec_release ,该函数首先将各个 page 实例的使用计数器减1,如果计数器为0,则将相应的页帧返还伙伴系统。
            spin_lock_irq(&zone->lru_lock);
        }
    }
    __mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
    pgdeactivate += pgmoved;
    if (buffer_heads_over_limit) {
        spin_unlock_irq(&zone->lru_lock);
        pagevec_strip(&pvec);
        spin_lock_irq(&zone->lru_lock);
    }

    pgmoved = 0;
    while (!list_empty(&l_active)) {
        page = lru_to_page(&l_active);
        prefetchw_prev_lru_page(page, &l_active, flags);
        VM_BUG_ON(PageLRU(page));
        SetPageLRU(page);
        VM_BUG_ON(!PageActive(page));
        list_move(&page->lru, &zone->active_list);
        pgmoved++;
        if (!pagevec_add(&pvec, page)) {
            __mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
            pgmoved = 0;
            spin_unlock_irq(&zone->lru_lock);
            __pagevec_release(&pvec);
            spin_lock_irq(&zone->lru_lock);
        }
    }
    __mod_zone_page_state(zone, NR_ACTIVE, pgmoved);

    __count_zone_vm_events(PGREFILL, zone, pgscanned);
    __count_vm_events(PGDEACTIVATE, pgdeactivate);
    spin_unlock_irq(&zone->lru_lock);

    pagevec_release(&pvec);
}

回收不活动页

收缩惰性链表

<mm/vmscan.c>

static unsigned long shrink_inactive_list(unsigned long max_scan,
                struct zone *zone, struct scan_control *sc)
{
    LIST_HEAD(page_list);
    struct pagevec pvec;
    unsigned long nr_scanned = 0;
    unsigned long nr_reclaimed = 0;

    pagevec_init(&pvec, 1);

    lru_add_drain();//将LRU缓存当前的内容分配到各个内存域的活动链表或惰性链表
    spin_lock_irq(&zone->lru_lock);
    do {
        struct page *page;
        unsigned long nr_taken;
        unsigned long nr_scan;
        unsigned long nr_freed;
        unsigned long nr_active;

        nr_taken = isolate_lru_pages(sc->swap_cluster_max,
                 &zone->inactive_list,
                 &page_list, &nr_scan, sc->order,
                 (sc->order > PAGE_ALLOC_COSTLY_ORDER)?
                         ISOLATE_BOTH : ISOLATE_INACTIVE);//从惰性链表的尾部删除一组内存页
        nr_active = clear_active_flags(&page_list);//清除active标志
        __count_vm_events(PGDEACTIVATE, nr_active);

        __mod_zone_page_state(zone, NR_ACTIVE, -nr_active);
        __mod_zone_page_state(zone, NR_INACTIVE,
                        -(nr_taken - nr_active));
        zone->pages_scanned += nr_scan;
        spin_unlock_irq(&zone->lru_lock);

        nr_scanned += nr_scan;
        nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);//发起对链表中的页回写操作（异步）

        /*
         * If we are direct reclaiming for contiguous pages and we do
         * not reclaim everything in the list, try again and wait
         * for IO to complete. This will stall high-order allocations
         * but that should be acceptable to the caller
         */
        if (nr_freed < nr_taken && !current_is_kswapd() &&
                    sc->order > PAGE_ALLOC_COSTLY_ORDER) {//如果并非所有进行回收的页都被回收,即 nr_freed < nr_taken ,那么链表中的某些页可能被锁定,无法在异步模式下写出。
            congestion_wait(WRITE, HZ/10);

            /*
             * The attempt at page out may have made some
             * of the pages active, mark them inactive again.
             */
            nr_active = clear_active_flags(&page_list);
            count_vm_events(PGDEACTIVATE, nr_active);

            nr_freed += shrink_page_list(&page_list, sc,
                            PAGEOUT_IO_SYNC);
        }

        nr_reclaimed += nr_freed;
        local_irq_disable();
        if (current_is_kswapd()) {
            __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan);
            __count_vm_events(KSWAPD_STEAL, nr_freed);
        } else
            __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scan);
        __count_zone_vm_events(PGSTEAL, zone, nr_freed);

        if (nr_taken == 0)
            goto done;

        spin_lock(&zone->lru_lock);
        /*
         * Put back any unfreeable pages.
         */
        while (!list_empty(&page_list)) {//不可回收的页必须返回到LRU链表。集中回收和失败的写出操作可能导致活动页出现在局部链表上,因而活动链表和惰性链表都是可能的目的地
            page = lru_to_page(&page_list);
            VM_BUG_ON(PageLRU(page));
            SetPageLRU(page);
            list_del(&page->lru);
            if (PageActive(page))
                add_page_to_active_list(zone, page);
            else
                add_page_to_inactive_list(zone, page);
            if (!pagevec_add(&pvec, page)) {
                spin_unlock_irq(&zone->lru_lock);
                __pagevec_release(&pvec);
                spin_lock_irq(&zone->lru_lock);
            }
        }
    } while (nr_scanned < max_scan);
    spin_unlock(&zone->lru_lock);
done:
    local_irq_enable();
    pagevec_release(&pvec);
    return nr_reclaimed;
}

回想前文,如果使用了集中回收, isolate_lru_pages 也会选取与链表上的页相邻的页帧。如果导致进行当前页回收操作的请求,其分配阶比 PAGE_ALLOC_COSTLY_ORDER 指定的阈值要大,那么内核将允许集中回收同时使用标记页相邻的活动和不活动页。对较小的分配阶,可能只使用不活动页。这种做法背后的原因是这样:如果内核仅限于不活动页,较大型的分配通常无法满足,对繁忙的内核来说,较大的连续物理内存区间中包含活动页的可能性是非常高的。 PAGE_ALLOC_COSTLY_ORDER 默认设置为3,这意味着内核认为分配8个(或以上)连续页是复杂的操作。

执行页面回收

shrink_page_list 从参数取得一组选中回收的页(一个链表),试图将各页写回到对应的后备存储器。

shrink_page_list 函数形成了内核的两个子系统之间的接口。

<mm/vmscan.c>

static unsigned long shrink_page_list(struct list_head *page_list,
                    struct scan_control *sc,
                    enum pageout_io sync_writeback)
{
    LIST_HEAD(ret_pages);
    struct pagevec freed_pvec;
    int pgactivate = 0;
    unsigned long nr_reclaimed = 0;

    cond_resched();

    pagevec_init(&freed_pvec, 1);
    while (!list_empty(page_list)) {
        struct address_space *mapping;
        struct page *page;
        int may_enter_fs;
        int referenced;

        cond_resched();

        page = lru_to_page(page_list);
        list_del(&page->lru);

        if (TestSetPageLocked(page))//该页由内核的其他部分锁定。如果是这样,该页不会回收;否则,当前代码路径会锁定该页,并进行回收
            goto keep;

        VM_BUG_ON(PageActive(page));

        sc->nr_scanned++;

        if (!sc->may_swap && page_mapped(page))
            goto keep_locked;

        /* Double the slab pressure for mapped and swapcache pages */
        if (page_mapped(page) || PageSwapCache(page))
            sc->nr_scanned++;

        may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
            (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

        if (PageWriteback(page)) {//页正在回写中
            /*
             * Synchronous reclaim is performed in two passes,
             * first an asynchronous pass over the list to
             * start parallel writeback, and a second synchronous
             * pass to wait for the IO to complete.  Wait here
             * for any page for which writeback has already
             * started.
             */
            if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
                wait_on_page_writeback(page);
            else
                goto keep_locked;
        }

        //一个内存页不能回收、而需要返回到活动LRU链表的情形
        referenced = page_referenced(page, 1);
        /* In active use or really unfreeable?  Activate it. */
        if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
                    referenced && page_mapping_inuse(page))
            goto activate_locked;

#ifdef CONFIG_SWAP
        /*
         * Anonymous process memory has backing store?
         * Try to allocate it some swap space here.
         */
        if (PageAnon(page) && !PageSwapCache(page))
            if (!add_to_swap(page, GFP_ATOMIC))//数据将写入到交换区
                goto activate_locked;
#endif /* CONFIG_SWAP */

        mapping = page_mapping(page);

        /*
         * The page is mapped into the page tables of one or more
         * processes. Try to unmap it here.
         */
        if (page_mapped(page) && mapping) {
            switch (try_to_unmap(page, 0)) {//将页从所有使用它的进程解除映射
            case SWAP_FAIL:
                goto activate_locked;
            case SWAP_AGAIN:
                goto keep_locked;
            case SWAP_SUCCESS:
                ; /* try to free the page below */
            }
        }

        if (PageDirty(page)) {//脏页
            if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
                goto keep_locked;
            if (!may_enter_fs)
                goto keep_locked;
            if (!sc->may_writepage)
                goto keep_locked;

            /* Page is dirty, try to write it out here */
            switch (pageout(page, mapping, sync_writeback)) {//内核通过调用 writepage 地址空间例程确保数据写回
            case PAGE_KEEP:
                goto keep_locked;
            case PAGE_ACTIVATE:
                goto activate_locked;
            case PAGE_SUCCESS://写请求成功发送到块层
                if (PageWriteback(page) || PageDirty(page))
                    goto keep;//将该页添加到 shrink_page_list函数局部的链表ret_pages 上
                /*
                 * A synchronous write - probably a ramdisk.  Go
                 * ahead and try to reclaim the page.
                 */
                if (TestSetPageLocked(page))
                    goto keep;
                if (PageDirty(page) || PageWriteback(page))
                    goto keep_locked;
                mapping = page_mapping(page);
            case PAGE_CLEAN://表示数据已经与后备存储器同步,内存可以回收
                ; /* try to free the page below */
            }
        }

        /*
         * If the page has buffers, try to free the buffer mappings
         * associated with this page. If we succeed we try to free
         * the page as well.
         *
         * We do this even if the page is PageDirty().
         * try_to_release_page() does not perform I/O, but it is
         * possible for a page to have PageDirty set, but it is actually
         * clean (all its buffers are clean).  This happens if the
         * buffers were written out directly, with submit_bh(). ext3
         * will do this, as well as the blockdev mapping.
         * try_to_release_page() will discover that cleanness and will
         * drop the buffers and mark the page clean - it can be freed.
         *
         * Rarely, pages can have buffers and no ->mapping.  These are
         * the pages which were not successfully invalidated in
         * truncate_complete_page().  We try to drop those buffers here
         * and if that worked, and the page is no longer mapped into
         * process address space (page_count == 1) it can be freed.
         * Otherwise, leave the page on the LRU so it is swappable.
         */
        if (PagePrivate(page)) {
            if (!try_to_release_page(page, sc->gfp_mask))
                goto activate_locked;
            if (!mapping && page_count(page) == 1)
                goto free_it;
        }

        if (!mapping || !remove_mapping(mapping, page))//将页与其地址空间分离。
            goto keep_locked;

free_it:
        unlock_page(page);
        nr_reclaimed++;
        if (!pagevec_add(&freed_pvec, page))//内核使用页向量来批量释放相关的物理内存
            __pagevec_release_nonlru(&freed_pvec);
        continue;

activate_locked:
        SetPageActive(page);
        pgactivate++;
keep_locked:
        unlock_page(page);
keep:
        list_add(&page->lru, &ret_pages);
        VM_BUG_ON(PageLRU(page));
    }
    list_splice(&ret_pages, page_list);
    if (pagevec_count(&freed_pvec))
        __pagevec_release_nonlru(&freed_pvec);
    count_vm_events(PGACTIVATE, pgactivate);
    return nr_reclaimed;
}

交换令牌

内核向某个当前换入页的进程颁发一枚所谓的交换令牌,且整个系统内只颁发一枚。交换令牌的好处在于,持有交换令牌的进程,其内存页不会被回收,或至少可以尽可能免遭回收。

交换令牌通过一个全局指针实现,该指针指向当前拥有令牌的进程的 mm_struct 实例:

<mm/thrash.c>
struct mm_struct *swap_token_mm;
static unsigned int global_faults;

global_faults计算调用do_swap_page的次数。

<linux/mm_types.h>
struct mm_struct {
...
    unsigned int faultstamp;
    unsigned int token_priority;
    unsigned int last_interval;
...
}

faultstamp 包含了内核上一次试图获取令牌时 global_faults 的值.
token_priority 是一个与交换令牌相关的调度优先级,用于控制对交换令牌的访问.
last_interval 表示该进程上次等待交换令牌的时间间隔的长度.

交换令牌通过调用grab_swap_token获取:

<mm/thrash.c>
void grab_swap_token(void)
{
    int current_interval;

    global_faults++;

    current_interval = global_faults - current->mm->faultstamp;

    if (!spin_trylock(&swap_token_lock))
        return;

    /* First come first served */
    if (swap_token_mm == NULL) {//当前令牌未分配给任何进程
        current->mm->token_priority = current->mm->token_priority + 2;
        swap_token_mm = current->mm;
        goto out;
    }

    if (current->mm != swap_token_mm) {
        if (current_interval < current->mm->last_interval)
            current->mm->token_priority++;//增加优先级
        else {
            if (likely(current->mm->token_priority > 0))
                current->mm->token_priority--;
        }
        /* Check if we deserve the token */
        if (current->mm->token_priority >
                swap_token_mm->token_priority) {//如果当前进程的令牌优先级超出持有者的优先级,那么从持有者去掉交换令牌,赋予当前请求进程
            current->mm->token_priority += 2;
            swap_token_mm = current->mm;
        }
    } else {//如果请求交换令牌的进程已经持有令牌，这意味着该进程会换入大量内存页。相应地,由于进程对内存页的需求非常强烈,因而应该增加令牌优先级。
        /* Token holder came in again! */
        current->mm->token_priority += 2;
    }

out:
    current->mm->faultstamp = global_faults;
    current->mm->last_interval = current_interval;
    spin_unlock(&swap_token_lock);
return;
}

grab_swap_token 只从内核中一处调用,即 do_swap_page 开始时,该函数负责换入页。如果请求页无法在交换缓存找到,需要从交换区读入,那么将获取令牌。

当不再需要当前交换令牌的 mm_struct 时,必须使用 put_swap_token 来释放当前进程的交换令牌。

has_swap_token 测试进程是否有交换令牌。但该检查只在内核中一处执行,即在内核检查一页是否已经被引用时(回想前文可知,这是判断一页是否将要被回收的基本要素之一,而 page_referenced_one 是 page_referenced 的一个子函数,只在那里调用)。

处理交换缺页异常

换入页

访问换出页导致的缺页异常，由mm/memory.c中的do_swap_page处理。

内核不仅要检查所请求的页是否仍然或已经在交换缓存中，它还使用了一种简单的预读方法，一次性从交换区读入几页，预防未来可能出现的缺页异常。

<mm/memory.c>

static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long address, pte_t *page_table, pmd_t *pmd,
        int write_access, pte_t orig_pte)
{
    spinlock_t *ptl;
    struct page *page;
    swp_entry_t entry;
    pte_t pte;
    int ret = 0;

    if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
        goto out;

    entry = pte_to_swp_entry(orig_pte);
    if (is_migration_entry(entry)) {
        migration_entry_wait(mm, pmd, address);
        goto out;
    }
    delayacct_set_flag(DELAYACCT_PF_SWAPIN);
    page = lookup_swap_cache(entry);
    if (!page) {//页不在交换缓存中
        grab_swap_token(); /* Contend for token _before_ read-in *///获取交换令牌
        swapin_readahead(entry, address, vma);//对所需页对应槽位和相邻槽位发出读请求。这需要的工作量相对较少，但对系统有相当的加速作，因为进程经常顺序访问内存中的数据。 发生这种情况时，对应的页已经通过预读机制读入内存中
        page = read_swap_cache_async(entry, vma, address);//在向块层发送读请求之前，会先锁定页。在块层完成数据传输时，对页解锁。因而，在do_swap_page中调用lock_page锁定该页就足够了，该操作将一直等到块层解锁该页 为止。
        if (!page) {
            /*
             * Back out if somebody else faulted in this pte
             * while we released the pte lock.
             */
            page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
            if (likely(pte_same(*page_table, orig_pte)))
                ret = VM_FAULT_OOM;
            delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
            goto unlock;
        }

        /* Had to read the page from swap area: Major fault */
        ret = VM_FAULT_MAJOR;
        count_vm_event(PGMAJFAULT);
    }

    mark_page_accessed(page);//使得内核将其认定为已访问过，
    lock_page(page);
    delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

    /*
     * Back out if somebody else already faulted in this pte.
     */
    page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
    if (unlikely(!pte_same(*page_table, orig_pte)))
        goto out_nomap;

    if (unlikely(!PageUptodate(page))) {
        ret = VM_FAULT_SIGBUS;
        goto out_nomap;
    }

    /* The page isn't present yet, go ahead with the fault. */

    inc_mm_counter(mm, anon_rss);
    pte = mk_pte(page, vma->vm_page_prot);
    if (write_access && can_share_swap_page(page)) {
        pte = maybe_mkwrite(pte_mkdirty(pte), vma);
        write_access = 0;
    }

    flush_icache_page(vma, page);
    set_pte_at(mm, address, page_table, pte);
    page_add_anon_rmap(page, vma, address);//加入逆向映射机制

    swap_free(entry);//检查是否可以释放交换区中对应的槽位
    if (vm_swap_full())
        remove_exclusive_swap_page(page);
    unlock_page(page);

    if (write_access) {//写模式访问
        /* XXX: We could OR the do_wp_page code with this one? */
        if (do_wp_page(mm, vma, address,
                page_table, pmd, ptl, pte) & VM_FAULT_OOM)//创建该页的一个副本
            ret = VM_FAULT_OOM;
        goto out;
    }

    /* No need to invalidate - it was non-present before */
    update_mmu_cache(vma, address, pte);
unlock:
    pte_unmap_unlock(page_table, ptl);
out:
    return ret;
out_nomap:
    pte_unmap_unlock(page_table, ptl);
    unlock_page(page);
    page_cache_release(page);
    return ret;
}

读取数据

read_swap_cache_async：

首先调用find_get_page来检查该页是否在交换缓存中。因为预读操作可能将该页读入交换缓存，所以可能出现这样的情况。如果该页已经在内存中，那么很好，因为这简化了处理：可以立即返回所要的页。

如果未找到该页，则必须调用alloc_page_vma（在非NUMA系统上，终归结为调用__alloc_pages）来分配一个新的内存页，容纳从交换区读入的数据①。用__alloc_pages做出的分配内存请求具有高优先级。例如，如果没有足够的空闲空间可用，内核会试图换出其他页来提供新的内存。该函数的失败（即返回NULL指针）是非常严重的问题，将导致直接放弃换入操作。在这种情况下，高层代码将通知OOMkiller关闭系统中具有相对大量内存页且不重要的进程，获得空闲内存。

如果页分配成功（通常都是这样，因为很少有用户无意使系统的负荷高到必须利用OOM killer的程度），内核使用add_to_swap_cache将添加该page实例到交换缓存，并使用lru_cache_add_active将其添加到（活动页的）LRU缓存。接下来，页数据通过swap_readpage从交换区传输到物理内存。

在必要的先决条件已经满足后，swap_readpage发起从硬盘到物理内存的数据传输。这是分为两个简短的步骤完成的。get_swap_bio产生一个适当的BIO请求，而submit_bio将该请求发送到块层。

下面两件事情需要特别注意。

add_page_to_swap_cache自动锁定页。
swap_readpage通知块层在页已经完全读入后调用end_swap_bio_read。如果一切进展顺利，该函数会对该页设置PG_uptodate标志并解锁。

交换预读

swapin_readahead

巴特西

Linux内核入门到放弃-页面回收和页交换-《深入Linux内核架构》笔记

概述

可换出页

页颠簸

页交换算法

第二次机会

LRU算法

Linux内核中的页面交换和回收

交换区的组织

检查内存使用情况

选择要换出的页

处理缺页异常

缩减内核缓存

管理交换区

数据结构

交换区的特征

用于实现非连续交换区的区间

创建交换区

激活交换区

交换缓存

标记换出页

交换缓存的结构

添加新页

分配槽位

分配交换区

缓存交换页

搜索一页

数据回写

页面回收

概述

数据结构

页向量

LRU缓存

确定页的活跃程度

收缩内存域

控制扫描

实现

隔离LRU页和集中回收

收缩活动页链表

回收不活动页

收缩惰性链表

执行页面回收

交换令牌

处理交换缺页异常

换入页

读取数据

交换预读

最新文章

热门文章