ID3/C4.5/Gini Index

*/-->

ID3/C4.5/Gini Index

1 ID3

Select the attribute with the highest information gain.

1.1 Formula

1.2 Formula

Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |C(i,D)|/|D|.
Expected information need to classify a tuple in D:
$$Info(D) = -\sum\limits_{i=1}^n{P_i}log_2P_i$$
Information needed (after using A to split D into v partitions)to classify D:
$$Info_A(D) = \sum\limits_{j=1}^v\frac{|D_j|}{|D|}Info(D_j)$$
So, information gained by branching on attribute A:
$$Gain(A) = Info(D)-Info_A(D)$$

1.3 Example

age income Student creditrating buyscomputer
<= 30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes excellent yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Class P:buyscomputer = "yes"
Class N:buyscomputer = "no"
the number of classs P is 9,while the number of class N is 5.

So,$$Info(D) = -\frac{9}{14}log_2\frac{9}{14} - \frac{5}{14}log_2\frac{5}{14} = 0.940$$

In Attribute age, the number of class P is 2,while the number of class N is 3.
So,$$Info(D_{
Similarly,
$$Info(D_{31...40}) = 0$$,$$Info(D_{>40}) = 0.971$$
Then,$$Info_{age}(D) = \frac{5}{14}Info(D_{40}) = 0.694$$
Therefore,$$Gain(age) = Info(D) - Info_age(D) = 0.246$$
Similarly,
$$Gain(income) = 0.029$$
$$Gain(Student) = 0.151$$
$$Gain(credit_rating) = 0.048$$

1.4 Question

What if the attribute's value is continuous? How can we handle it?
1.The best split point for A
+Sort the value A in increasing order
+Typically, the midpoint between each pair of adjacent values is considered as a possible split point
-(a i +a i+1 )/2 is the midpoint between the values of a i and a i+1
+The point with the minimum expected information requirement for A is selected as the split-point for A
2.Split:
+D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point.

2 C4.5

C4.5 is a successor of ID3.

2.1 Formula

$$SpiltInfo_A(D) = -\sum\limits_{j=1}^v\frac{|D_j|}{|D|}*log_2\frac{|D_j|}{|D|}$$
Then the GainRatio equals to,
$$GainRatio(A=Gain(A)/SplitInfo(A)$$
The attribute with the maximun gain ratio is selected as the splitting attribute.

3 Gini Index

3.1 Formula

If a data set D contains examples from n classes, gini index gini(D) is defined as
$$gini(D) = 1 - \sum\limits_{j=1}^nP_j^2$$
where pj is the relative frequency of class j in D.
If Data set D is split on A which have n classes.Then
$$gini_A(D) = \sum\limits_{i=1}^n\frac{D_i}{D}gini(D_i)$$
Reduction in Impurity
$$\Delta gini(A) = gini(D)-gini_A(D)$$

4 Summary

ID3/C4.5 isn't suitable for large amount of trainning set,because they have to repeat to sort and scan training set for many times. That will cost much time than other classification alogrithms.
The three measures,in general, return good results but
1.Information gain:
-biased towards multivalued attributes
2.Gain ratio:
-tends to prefer unbalanced splits in which one partition is much smaller than the other.
3.Gini index:
-biased towards multivalued attributes
-has difficulty when # of classes is large
-tends to favor tests that result in equal-sized partitions and purity in both partitions.

5 Other Attribute Selection Measures

1.CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence
2.C-SEP: performs better than info. gain and gini index in certain cases
3.G-statistics: has a close approximation to χ 2 distribution
4.MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
5.Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others

Author: mlhy

Created: 2015-10-06 二 14:39

Emacs 24.5.1 (Org mode 8.2.10)

最新文章

  1. 补充 作业八:团队项目——Alpha阶段项目总结 补充
  2. undefined和void
  3. Ubuntu rsync同步
  4. How Tomcat Works(十八)
  5. [改善Java代码]不要随便设置随机种子
  6. 添加数据库的Maven依赖(SqlServer,Oracle)
  7. poj3252 Round Numbers
  8. Linux SSH: key, agent, keychain
  9. Web Server PROPFIND Method internal IP Discosure
  10. Ubuntu14.04安装PostpreSQL9.3.5记录
  11. PRINCE2学习
  12. prometheus rules
  13. JAR包结构,META-INF/MANIFEST.MF文件详细说明[全部属性][打包][JDK]
  14. HP Elitebook 830 G5/Win10蓝屏 UcmUcsi.sys 错误解决
  15. Mysql必知必会 第一章 了解SQL
  16. 【Linux命令】top命令
  17. loadrunner&#160;脚本录制-Protocol&#160;Advisor协议分析器的使用
  18. 开始写博客,学习Linq(2)
  19. Bubble Sort冒泡排序
  20. maven-javadoc-plugin 出现错误Unsupported major.minor version 51.0

热门文章

  1. date linux系统校正时间
  2. linux忘记密码
  3. 吴裕雄--天生自然TensorFlow2教程:测试(张量)- 实战
  4. python刷LeetCode:9. 回文数
  5. Java 性能优化:面向对象及基础类型使用优化
  6. C++逐行读取txt
  7. react 16 性能提升 总结
  8. luogu P4219 [BJOI2014]大融合
  9. 【收藏】免费开源好看的bootstrap后台模板
  10. UVA 10891 SUM游戏 DP