MCE现象

Intel在Pentium 4、Xenon和P6系列处理器中实现了机器检查(Machinecheck)架构,提供能够检测和报告硬件(机器)的错误机制,如系统总线错误、ECC错误、奇偶校验错误、缓存错误、TLB错误等。它包括一直MSR(Model-Specific Registers)寄存器,用来设置机器检查和额外的bank MSR记录错误。

当机器检查到不可纠正的machine-check错误时,就触发一个machine-check异常。machine-check架构不允许在出现MCE后处理器重启,但MCE处理程序可以从MSR寄存器收集相关信息。

CPU 7: Machine Check Exception: 5 Bank 0: b200004010000400

RIP !INEXACT! 10:<ffffffff8010f16e> {mwait_idle+0x5e/0x90}

TSC 1952dbeebcc8

Kernel panic: Machine check

Reconfiguring memory bank information….

This may take a while….

done waiting: 3 cpus not responding

Warning: Non-empty request queue

I/O requests in flight at dump time

CPU 7: Machine Check Exception: 4 Bank 0: f200004040000400

RIP !INEXACT! 10:<ffffffff8011ef69>

MCE错误判断原则

凡是内核死机打印“Machine Check Exception“或内核栈信息中打印有do_machine_check()函数,均为MCE问题。

MCE错误来源

  • PCI-E设备信号质量/时钟
  • CPU芯片损坏/设计BUG

    CPU Cache损坏或其它故障

  • CPU可能的缺陷

    如CPU生产制造过程中带来的缺陷

  • 内存坏/接触不良
  • BIOS配置不当
  • OS/MCE中断程序Bug
  • 环境因素,如温度/湿度

MCE错误码解析

以上面MCE错误为例,Machine Check Exception和Bank 0(5)的值分别对应IA32_MCG_STATUS MSR、IA32_MCi_STATUS寄存器。

则对应的寄存器值为:

IA32_MCG_STATUS MSR寄存器的值为0000000000000004

IA32_MC0_STATUS MSR的值为f200000410000800

IA32_MC5_STATUS MSR的值为f200001044100e0f

根据MSR的值,对照Intel编程手册和Intel其他资料,就可以比较容易找出MCE原因。

dmesg显示

1
2
3
4
5
6
7
8
...

sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0
EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr= 0x67081b300 => socket=0, Channel=3(mask=8), rank=0 ...

保存4行log为mlog

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# mcelog --ascii < /tmp/mlog
WARNING: with --dmi mcelog --ascii must run on the same machine with the
same BIOS/memory configuration as where the machine check occurred.
sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
Wed Sep 2 16:14:36 2015
CPU 0 BANK 5 MISC 2140040486 ADDR 67081b300
STATUS 8c00004000010093 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 45
WARNING: SMBIOS data is often unreliable. Take with a grain of salt!
<24> DIMM 1333 Mhz Res13 Width 72 Data Width 64 Size 16 GB
Device Locator: Node0_Channel2_Dimm0
Bank Locator: Node0_Bank0
Manufacturer: Hynix Semiconducto
Serial Number: 40743B5A
Asset Tag: Dimm2_AssetTag
Part Number: HMT42GR7BFR4A-PB
TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0
EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x67081b300 => socket=0, Channel=3(mask=8), rank=0

根据
Part Number: HMT42GR7BFR4A-PB
Serial Number: 40743B5A

在lshw中找相应硬件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
...

	 *-memory:0
description: System Memory
physical id: 2d
slot: System board or motherboard
*-bank:0
description: DIMM 1333 MHz (0.8 ns)
product: HMT42GR7BFR4A-PB
vendor: Hynix Semiconducto
physical id: 0
serial: 905D21AE
slot: Node0_Channel1_Dimm0
size: 16GiB
width: 64 bits
clock: 1333MHz (0.8ns)
*-bank:1
description: DIMM Synchronous [empty]
product: A1_Dimm1_PartNumber
vendor: Dimm1_Manufacturer
physical id: 1
serial: Dimm1_SerNum
slot: Node0_Channel1_Dimm1
width: 64 bits
*-bank:2
description: DIMM 1333 MHz (0.8 ns)
product: HMT42GR7BFR4A-PB
vendor: Hynix Semiconducto
physical id: 2
serial: 40743B5A
slot: Node0_Channel2_Dimm0
size: 16GiB
width: 64 bits
clock: 1333MHz (0.8ns) ...

最新文章

  1. java导入excel时遇到的版本问题
  2. 程序员下一门要学的编程语言Swift
  3. 烂泥:rsync配置文件详解
  4. non
  5. hdu3966 树链剖分+成段更新
  6. Strust的基础情况
  7. 20160805_CentOS6_控制台切换
  8. JavaScript创建表格的两种方式
  9. nginx上传文件
  10. Linux mail 命令使用
  11. python challenge 16
  12. 【转】Java基础笔记 – 枚举类型的使用介绍和静态导入--不错
  13. html学习笔记一
  14. Delphi图像处理 -- 最大值
  15. touchmover手机移动端的拖动
  16. Android’s HTTP Clients (httpClient 和 httpURLConnect 区别)
  17. iOS深浅拷贝
  18. Nginx 流量和连接数限制
  19. K近邻(K Nearest Neighbor-KNN)原理讲解及实现
  20. Notification 通知传值

热门文章

  1. ES6躬行记(1)——let和const
  2. 在.net core 中PetaPoco结合EntityFrameworkCore使用codefirst方法进行开发
  3. 【Javascript系列】变量作用域
  4. Dubbo 入门之二 ——- 项目结构解析
  5. 工作中常用Windows快捷键整理(1)-快速关闭网页
  6. 将ASP.NET网站部署到服务器IIS上
  7. EF C# ToPagedList方法 The method &#39;Skip&#39; is only supported for sorted input in LINQ to Entities. The method &#39;OrderBy&#39; must ……
  8. Nmap 命令操作详解
  9. 【原创】MVC+ZTree实现权限树的功能
  10. [Linux] Nginx响应压缩gzip