周末一大早被报警惊醒,rm频繁切换

急急忙忙排查 看到两处错误日志

错误信息1

ervation <memory:0, vCores:0>
2019-12-21 11:51:57,781 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode.unreserveResource(FSSchedulerNode.java:88)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.unreserve(FSAppAttempt.java:589)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:899)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:846)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1479)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804)
at java.lang.Thread.run(Thread.java:748)

错误信息2

明月照我去搬砖 2019/12/21 14:51:07
2019-12-21 07:37:45,533 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainerInternal(FairScheduler.java:902)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:564)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:837)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1475)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:804)
at java.lang.Thread.run(Thread.java:748)
2019-12-21 07:37:45,534 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..

查看源码处FairScheduler

 @Override
protected void completedContainerInternal(
RMContainer rmContainer, ContainerStatus containerStatus,
RMContainerEventType event) {
try {
writeLock.lock();
Container container = rmContainer.getContainer(); // Get the application for the finished container
FSAppAttempt application =
getCurrentAttemptForContainer(container.getId());
ApplicationId appId =
container.getId().getApplicationAttemptId().getApplicationId();
if (application == null) {
LOG.info("Container " + container + " of" +
" finished application " + appId +
" completed with event " + event);
return;
} // Get the node on which the container was allocated
FSSchedulerNode node = getFSSchedulerNode(container.getNodeId()); if (rmContainer.getState() == RMContainerState.RESERVED) {
application.unreserve(rmContainer.getReservedPriority(), node); //这里将node上该container资源释放
} else {
try {
application.containerCompleted(rmContainer, containerStatus, event);
node.releaseContainer(rmContainer.getContainerId(), false);
updateRootQueueMetrics();
LOG.info("Application attempt " + application.getApplicationAttemptId()
+ " released container " + container.getId() + " on node: " + node
+ " with event: " + event);
}catch (Exception e){
LOG.error(e.getMessage(), e);
}
}
} finally {
writeLock.unlock();
}
}

跟进去看下

  /**
* Remove the reservation on {@code node} at the given {@link Priority}.
* This dispatches SchedulerNode handlers as well.
*/
public void unreserve(Priority priority, FSSchedulerNode node) {
RMContainer rmContainer = node.getReservedContainer();
unreserveInternal(priority, node);
node.unreserveResource(this);
clearReservation(node);
getMetrics().unreserveResource(node.getPartition(),
getUser(), rmContainer.getContainer().getResource());
}
  @Override
public synchronized void unreserveResource(
SchedulerApplicationAttempt application) {
// Cannot unreserve for wrong application...
ApplicationAttemptId reservedApplication =
getReservedContainer().getContainer().getId().getApplicationAttemptId(); //获取不到该container的attemptId 报空指针
if (!reservedApplication.equals(
application.getApplicationAttemptId())) {
throw new IllegalStateException("Trying to unreserve " +
" for application " + application.getApplicationId() +
" when currently reserved " +
" for application " + reservedApplication.getApplicationId() +
" on node " + this);
} setReservedContainer(null);
this.reservedAppSchedulable = null;
}

第二处报错是

rmContainer为null 了对removeapplicationattent的调用和对相同尝试的moveApplication的处理顺序很短则应用程序尝试仍将包含队列引用,
但已从队列的应用程序列表中删除如果对removeapplicationattent的两个调用连续出现,则应用程序仍将包含队列引用,但已从队列的应用程序列表
中删除
在这两种情况下,第二个调用必须在进行removeApplication调
用之前进入。 其实就是重复释放container 但container已经在该节点上释放了 有一个状态不一致问题
这边是用的写锁 当一个线程已经读到containerId 另一线程释放掉 再次释放 就会出现异常 修改方法一
 /**
* Clean up a completed container.
*/
@Override
protected synchronized void completedContainerInternal(
RMContainer rmContainer, ContainerStatus containerStatus,
RMContainerEventType event) {
try {
// writeLock.lock();//注释写锁 改用重锁 Container container = rmContainer.getContainer(); // Get the application for the finished container
FSAppAttempt application =
getCurrentAttemptForContainer(container.getId());
ApplicationId appId =
container.getId().getApplicationAttemptId().getApplicationId();
if (application == null) {
LOG.info("Container " + container + " of" +
" finished application " + appId +
" completed with event " + event);
return;
}

修改方法二

// Get the node on which the container was allocated
FSSchedulerNode node = getFSSchedulerNode(container.getNodeId());
try {
if (rmContainer.getState() == RMContainerState.RESERVED) {
application.unreserve(rmContainer.getReservedPriority(), node);
} else {
// try { //将try移到上方 覆盖unreserve方法
  application.containerCompleted(rmContainer, containerStatus, event);
node.releaseContainer(rmContainer.getContainerId(), false);
updateRootQueueMetrics();
LOG.info("Application attempt " + application.getApplicationAttemptId() + " released container " + container.getId(
) + " on node: " + node + " with event: " + event);
}catch (Exception e){
LOG.error(e.getMessage(), e); //将该异常处理掉而不是抛出
} }
 

最新文章

  1. 利用H5和ChromiumWebBrowser构建应用
  2. 剑指offer五:
  3. [k]自定义样式下拉菜单
  4. HTML5_嵌套移动APP端的H5页面meta标签
  5. 转载 jQuery的三种$()
  6. Java关键字介绍之this与super
  7. Cocos2d-x中背景音乐播放暂停与继续
  8. 冒泡算法(C++模板实现)
  9. ARM简介(科普文)
  10. textarea定位光标
  11. 从Map、JSONObject取不存在键值对时的异常情况
  12. leetcode70
  13. 一天搞定CSS:层级(z-index)--18
  14. 【做题】CSA72G - MST and Rectangles——Borůvka&amp;线段树
  15. 一个jQuery对象绑定多个事件
  16. openstack系列文章(2)dashboard
  17. Windows平台下SVN安装配置及使用
  18. Java 8 中的方法引用
  19. [CSL 的字符串][栈,模拟]
  20. metasploit(MSF)对windows的ms17-010漏洞利用

热门文章

  1. learning express step(四)
  2. 用tecplot提取数据用于重构模型
  3. 如何确定哪个SMB客户端/会话在Server 2008R2 Windows文件服务器上打开了特定文件?
  4. csp-s模拟99
  5. vuejs是如何编译checkbox数组的v-model的
  6. Qt pro工程文件介绍
  7. 我的新书,ArcGIS从0到1,京东接受预定,有160个视频,851分钟
  8. ybatis 逆向工程 自动生成的mapper文件没有 主键方法
  9. Facebook币Libra学习-1.核心概念
  10. appStore上传苹果应用程序软件发布