朋友遇到一个很棘手的问题,查看服务器日志,报以下错误:

01/21/2014 11:47:43,spid296,未知,错误: 18056,严重性: 20,状态: 29。
01/21/2014 11:47:43,spid495,未知,
The client was unable to reuse a session with SPID 495<c/>
which had been reset for connection pooling. The failure ID is 29.
This error may have been caused by an earlier operation failing.
Check the error logs for failed operations immediately before this error message.

百度一下, 找到相关文章:http://blog.csdn.net/yangzhawen/article/details/8209167

一方面让开发从IIS角度去解决,另一方面从SQL SERVER入手,继续查看错误日志,发现以下错误:

01/21/2014 11:46:10,spid8s,未知,
SQL Server has encountered 3 occurrence(s) of I/O requests taking longer
than 15 seconds to complete on file [H:templog.ldf] in database [tempdb] The OS file handle is 0x0000000000001254.
The offset of the latest long I/O is: 0x000000184a6a00 01/21/2014 11:46:10,spid8s,未知,
SQL Server has encountered 3 occurrence(s) of I/O
requests taking longer than 15 seconds to complete
on file [H:\HisData\TXX.mdf] in database [xxx] The OS file handle is 0x0000000000001268.
The offset of the latest long I/O is: 0x00000349a6e000

使用以下代码查看当前耗CPU和IO比较多的执行计划

--最耗费CPU的前个查询以及它们的执行计划
SETTRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITHTEMP AS(
SELECTCAST((qs.total_worker_time) / 1000000.0 AS DECIMAL(28,2)) AS TotalSecondsForCPUTime
,CAST(qs.total_worker_time* 100.0 / qs.total_elapsed_time AS DECIMAL(28,2)) AS CPUPersent
,CAST((qs.total_elapsed_time- qs.total_worker_time)* 100.0 / qs.total_elapsed_time AS DECIMAL(28, 2)) AS WaitingPersent
,qs.execution_countExecutionCount
,CAST((qs.total_worker_time)/ 1000000.0 / qs.execution_count AS DECIMAL(28, 2)) AS AvgSecondsForCPUTime
,SUBSTRING (qt.text,(qs.statement_start_offset/2) + 1,
((CASEWHEN qs.statement_end_offset = -1
THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2) + 1) AS IndividualQuery
,qt.text AS ParentQuery
,DB_NAME(qt.dbid) AS DatabaseName
,qp.query_planQueryPlan
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as qt
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHEREqs.total_elapsed_time > 0 ) SELECTTOP(20)* FROM TEMP
ORDERBY TEMP.TotalSecondsForCPUTime DESC --最占IO的前个查询以及它们的执行计划
SETTRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITHTEMP AS(
SELECT (qs.total_logical_reads + qs.total_logical_writes) AS TotalIO
,(qs.total_logical_reads+ qs.total_logical_writes) / qs.execution_count AS AvgIO
,qs.execution_count AS ExecutionCount
,SUBSTRING (qt.text,(qs.statement_start_offset/2) + 1,
((CASEWHEN qs.statement_end_offset = -1
THENLEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2
ELSE qs.statement_end_offset
END- qs.statement_start_offset)/2) + 1) AS IndividualQuery
,qt.text AS ParentQuery
, DB_NAME(qt.dbid) AS DatabaseName
,qp.query_plan AS QueryPlan
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as qt
CROSSAPPLY sys.dm_exec_query_plan(qs.plan_handle) qp) SELECTTOP(20)* FROM TEMP
ORDERBY TEMP.TotalIO DESC

发现一部分耗时长和IO高的执行语句,提起这些语句待分析优化

在使用以下代码来查看当前请求和阻塞:

SELECT
SPID = er.session_id
,STATUS = ses.STATUS
,[LOGIN] = ses.login_name
,HOST = ses.host_name
,BlkBy = er.blocking_session_id
,DBName = DB_NAME(er.database_id)
,CommandType = er.command
,SQLStatement = st.text
,ObjectName = OBJECT_NAME(st.objectid)
,ElapsedMS = er.total_elapsed_time
,CPUTime = er.cpu_time
,IOReads = er.logical_reads + er.reads
,IOWrites = er.writes
,LastWaitType = er.last_wait_type
,StartTime = er.start_time
,Protocol = con.net_transport
,ConnectionWrites = con.num_writes
,ConnectionReads = con.num_reads
,ClientAddress = con.client_net_address
,Authentication = con.auth_scheme
FROM sys.dm_exec_requests er
OUTER APPLY sys.dm_exec_sql_text(er.sql_handle) st
LEFT JOIN sys.dm_exec_sessions ses
ON ses.session_id = er.session_id
LEFT JOIN sys.dm_exec_connections con
ON con.session_id = ses.session_id
WHERE er.session_id > 50
ORDER BY er.blocking_session_id DESC,er.session_id

发现以下问题:

MSDN上有如下介绍:

ConnectionWrites:此连接中已发生的读包次数。可为空值。
ConnectionReads:此连接中已发生的写数据包次数。可为空值。

在普通的OLTP数据上,ConnectionWrites和ConnectionReads 基本在几十到上百,而在这台服务器上达到了352W之多,经确认,192.168.8.16 上是新产品服务器,服务器出现问题与该产品上线时间基本吻合,基本可断定问题根源就是该新产品。

剩下问题表示分析为什么需要如此庞大的网络包,寻找解决之道。

最新文章

  1. python_接口开发
  2. PHP 二维数组根据某个字段排序
  3. iOS图片编辑功能实现
  4. Linux多线程系列-2-条件变量的使用(线程安全队列的实现)
  5. Mongodb集群配置(sharding with replica set)
  6. linux下SSH远程连接服务慢解决方案
  7. 如何学好C语言
  8. MyEclipse下安装FatJar打包工具
  9. apache基础
  10. Java成员变量与局部变量的区别
  11. Webpack-dev-server的proxy用法
  12. Java 中常见的数据结构
  13. java.util.ServiceLoader的用法
  14. Codeforces 681C. Heap Operations 优先队列
  15. 【原】Docker入门之Centos7.0+安装
  16. mormot当作内存数据库(缓存)使用
  17. word图片自动编号与引用(转)
  18. 浅谈js中的垃圾两种回收机制
  19. ArcGIS for Android图层记录数,图层选择记录,图层字段数
  20. NFS的安装以及windows/linux挂载linux网络文件系统NFS

热门文章

  1. 解决Maven出现Plugin execution not covered by lifecycle configuration 错误
  2. JSTL-c:forEach标签详解
  3. All sentinels down, cannot determine where is mymaster master is running...
  4. gridview发布后,编辑改为edit 原因是未安装 dotNetFx40LP_Full_x86_x64zh-Hans中文语言包
  5. [jOOQ中文]3. 数据库版本管理工具Flyway
  6. 实战zabbix3.0.2 使用percona mysql插件监控mysql5.7
  7. 智能指针--C++
  8. 归纳整理Linux下C语言常用的库函数----文件操作
  9. archlinux错误:无法提交处理 (无效或已损坏的软件包)
  10. 使用 ipmitool 实现 Linux 系统下对服务器的 ipmi 管理