assuming that you're using xgboost to fit boosted trees for binary classification. The importance matrix is actually a data.table object with the first column listing the names of all the features actually used in the boosted trees.

The meaning of the importance data table is as follows:

  1. The Gain implies the relative contribution of the corresponding feature to the model calculated by taking each feature's contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.
  2. The Cover metric means the relative number of observations related to this feature. For example, if you have 100 observations, 4 features and 3 trees, and suppose feature1 is used to decide the leaf node for 10, 5, and 2 observations in tree1, tree2 and tree3 respectively; then the metric will count cover for this feature as 10+5+2 = 17 observations. This will be calculated for all the 4 features and the cover will be 17 expressed as a percentage for all features' cover metrics.
  3. The Frequence (frequency) is the percentage representing the relative number of times a particular feature occurs in the trees of the model. In the above example, if feature1 occurred in 2 splits, 1 split and 3 splits in each of tree1, tree2 and tree3; then the weightage for feature1 will be 2+1+3 = 6. The frequency for feature1 is calculated as its percentage weight over weights of all features.

The Gain is the most relevant attribute to interpret the relative importance of each feature.

Gain is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

Cover measures the relative quantity of observations concerned by a feature.

Frequency is a simpler way to measure the Gain. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).

最新文章

  1. 前端HTML5几种存储方式的总结
  2. 【转】Java集合类
  3. Web 在线文件管理器学习笔记与总结(11)获取文件夹信息 (12)返回上一级操作
  4. linux驱动模型<输入子系统>
  5. MVC神韵---你想在哪解脱!(十二)
  6. Java编程思想读书笔记--第21章并发
  7. 获取windows身份认证网站页面内容
  8. Python学习笔记4-如何快速的学会一个Python的模块、方法、关键字
  9. 竖向折叠二级导航JS代码(可防刷新ul/li结构)
  10. 【深入浅出jQuery】源码浅析--整体架构(转)
  11. Mysql中Insert into xxx on duplicate key update问题
  12. 2018届研究生招生预推免(THU,HIT)经历分享——guochengtao
  13. lightoj 1074
  14. luogu P3767 膜法
  15. Eclipse+maven+scala+spark环境搭建
  16. Series 和 Dataframe 的 rank 方法
  17. 尚硅谷springboot学习33-整合mybatis
  18. 报错解决——DateTimeField *** received a naive datetime (***) while time zone support is active
  19. oracle中使用函数控制过程是否执行(结合job使用)
  20. 解决sqlplus: command not found

热门文章

  1. 免费数据集下载网站【dataset】
  2. ubuntu :安装skype聊天工具
  3. GOF23设计模式之装饰模式(decorator)
  4. Android 语音处理
  5. mysql 使用 informatin_schema tables 创建 shell commands
  6. 1103 Integer Factorization
  7. 阿里云专有网络下一键安装RouterOS-ROS系统
  8. js与jQuery的区分
  9. ERROR无法从静态上下文中引用非静态变量
  10. Julia - 循环