xgboost是基于GBDT原理进行改进的算法,效率高,并且可以进行并行化运算;

而且可以在训练的过程中给出各个特征的评分,从而表明每个特征对模型训练的重要性,

调用的源码就不准备详述,本文主要侧重的是计算的原理,函数get_fscore源码如下,

源码来自安装包:xgboost/python-package/xgboost/core.py

通过下面的源码可以看出,特征评分可以看成是被用来分离决策树的次数,而这个与

《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异,此处需要注意。

注:考虑的角度不同,计算方法略有差异。

 def get_fscore(self, fmap=''):
"""Get feature importance of each feature. Parameters
----------
fmap: str (optional)
The name of feature map file
""" return self.get_score(fmap, importance_type='weight') def get_score(self, fmap='', importance_type='weight'):
"""Get feature importance of each feature.
Importance type can be defined as:
'weight' - the number of times a feature is used to split the data across all trees.
'gain' - the average gain of the feature when it is used in trees
'cover' - the average coverage of the feature when it is used in trees Parameters
----------
fmap: str (optional)
The name of feature map file
""" if importance_type not in ['weight', 'gain', 'cover']:
msg = "importance_type mismatch, got '{}', expected 'weight', 'gain', or 'cover'"
raise ValueError(msg.format(importance_type)) # if it's weight, then omap stores the number of missing values
if importance_type == 'weight':
# do a simpler tree dump to save time
trees = self.get_dump(fmap, with_stats=False) fmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue # extract feature name from string between []
fid = arr[1].split(']')[0].split('<')[0] if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
else:
fmap[fid] += 1 return fmap else:
trees = self.get_dump(fmap, with_stats=True) importance_type += '='
fmap = {}
gmap = {}
for tree in trees:
for line in tree.split('\n'):
# look for the opening square bracket
arr = line.split('[')
# if no opening bracket (leaf node), ignore this line
if len(arr) == 1:
continue # look for the closing bracket, extract only info within that bracket
fid = arr[1].split(']') # extract gain or cover from string after closing bracket
g = float(fid[1].split(importance_type)[1].split(',')[0]) # extract feature name from string before closing bracket
fid = fid[0].split('<')[0] if fid not in fmap:
# if the feature hasn't been seen yet
fmap[fid] = 1
gmap[fid] = g
else:
fmap[fid] += 1
gmap[fid] += g # calculate average value (gain/cover) for each feature
for fid in gmap:
gmap[fid] = gmap[fid] / fmap[fid] return gmap

GBDT特征评分的计算说明原理:

链接:1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

详细的代码说明过程:可以从上面的链接进入下面的链接:

http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

最新文章

  1. macbook 放flash发烫,转html5
  2. centos7 新手基本命令
  3. (转)eclipse项目导入到android studio中
  4. 【剑指offer 面试题12】打印1到最大的n位数
  5. Sqlite官方下载对应版本注意细节
  6. Eddy&#39;s爱好(dfs+容斥)
  7. Swift编程语言学习9—— 存储属性和计算属性
  8. Bootstrap模态框原理分析及问题解决
  9. Java 实现异步调用
  10. UEFI和GPT下硬盘克隆后的BCD引导修复
  11. windows平台快速搭建Linux(CentOS)
  12. MindMaster学习笔记
  13. 026.7 网络编程 URL对象
  14. 使用Axure RP原型设计实践08,制作圆角文本框
  15. UVa 12003 Array Transformer (分块)
  16. Chapter 3 Phenomenon——15
  17. BZOJ 2754 SCOI 2012 喵星球上的点名 后缀数组 树状数组
  18. oracle创建表空间个用户四部曲
  19. 使用ProcDump工具抓取dump
  20. 模块化开发RequireJS之shim配置

热门文章

  1. [Linux] vimdiff 快速比较和合并少量文件
  2. QM模块包含主数据(Master data)和功能(functions)
  3. 【前端】Web前端学习笔记【2】
  4. Socket之TCP连接_TcpNoDelay
  5. [模板] SAP
  6. Java程序员们最常犯的10个错误
  7. Python3基础 给一起列表起两个名字
  8. Unity手撸2048小游戏——自动生成4*4棋盘
  9. JavaACOFramework的各个类介绍(part3 : Ant4ACS类)
  10. 蚁群算法简介(part3: 蚁群算法之更新信息素)