ID3/C4.5/Gini Index
ID3/C4.5/Gini Index
*/-->
ID3/C4.5/Gini Index
1 ID3
Select the attribute with the highest information gain.
1.1 Formula
1.2 Formula
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |C(i,D)|/|D|.
Expected information need to classify a tuple in D:
$$Info(D) = -\sum\limits_{i=1}^n{P_i}log_2P_i$$
Information needed (after using A to split D into v partitions)to classify D:
$$Info_A(D) = \sum\limits_{j=1}^v\frac{|D_j|}{|D|}Info(D_j)$$
So, information gained by branching on attribute A:
$$Gain(A) = Info(D)-Info_A(D)$$
1.3 Example
age | income | Student | creditrating | buyscomputer |
---|---|---|---|---|
<= 30 | high | no | fair | no |
<=30 | high | no | excellent | no |
31…40 | high | no | fair | yes |
>40 | medium | no | fair | yes |
>40 | low | yes | fair | yes |
>40 | low | yes | excellent | no |
31…40 | low | yes | excellent | yes |
<=30 | medium | no | fair | no |
<=30 | low | yes | fair | yes |
>40 | medium | yes | excellent | yes |
<=30 | medium | yes | excellent | yes |
31…40 | medium | no | excellent | yes |
31…40 | high | yes | fair | yes |
>40 | medium | no | excellent | no |
Class P:buyscomputer = "yes"
Class N:buyscomputer = "no"
the number of classs P is 9,while the number of class N is 5.
So,$$Info(D) = -\frac{9}{14}log_2\frac{9}{14} - \frac{5}{14}log_2\frac{5}{14} = 0.940$$
In Attribute age, the number of class P is 2,while the number of class N is 3.
So,$$Info(D_{
Similarly,
$$Info(D_{31...40}) = 0$$,$$Info(D_{>40}) = 0.971$$
Then,$$Info_{age}(D) = \frac{5}{14}Info(D_{40}) = 0.694$$
Therefore,$$Gain(age) = Info(D) - Info_age(D) = 0.246$$
Similarly,
$$Gain(income) = 0.029$$
$$Gain(Student) = 0.151$$
$$Gain(credit_rating) = 0.048$$
1.4 Question
What if the attribute's value is continuous? How can we handle it?
1.The best split point for A
+Sort the value A in increasing order
+Typically, the midpoint between each pair of adjacent values is considered as a possible split point
-(a i +a i+1 )/2 is the midpoint between the values of a i and a i+1
+The point with the minimum expected information requirement for A is selected as the split-point for A
2.Split:
+D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point.
2 C4.5
C4.5 is a successor of ID3.
2.1 Formula
$$SpiltInfo_A(D) = -\sum\limits_{j=1}^v\frac{|D_j|}{|D|}*log_2\frac{|D_j|}{|D|}$$
Then the GainRatio equals to,
$$GainRatio(A=Gain(A)/SplitInfo(A)$$
The attribute with the maximun gain ratio is selected as the splitting attribute.
3 Gini Index
3.1 Formula
If a data set D contains examples from n classes, gini index gini(D) is defined as
$$gini(D) = 1 - \sum\limits_{j=1}^nP_j^2$$
where pj is the relative frequency of class j in D.
If Data set D is split on A which have n classes.Then
$$gini_A(D) = \sum\limits_{i=1}^n\frac{D_i}{D}gini(D_i)$$
Reduction in Impurity
$$\Delta gini(A) = gini(D)-gini_A(D)$$
4 Summary
ID3/C4.5 isn't suitable for large amount of trainning set,because they have to repeat to sort and scan training set for many times. That will cost much time than other classification alogrithms.
The three measures,in general, return good results but
1.Information gain:
-biased towards multivalued attributes
2.Gain ratio:
-tends to prefer unbalanced splits in which one partition is much smaller than the other.
3.Gini index:
-biased towards multivalued attributes
-has difficulty when # of classes is large
-tends to favor tests that result in equal-sized partitions and purity in both partitions.
5 Other Attribute Selection Measures
1.CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence
2.C-SEP: performs better than info. gain and gini index in certain cases
3.G-statistics: has a close approximation to χ 2 distribution
4.MDL (Minimal Description Length) principle (i.e., the simplest solution
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
5.Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
最新文章
- 补充 作业八:团队项目——Alpha阶段项目总结 补充
- undefined和void
- Ubuntu rsync同步
- How Tomcat Works(十八)
- [改善Java代码]不要随便设置随机种子
- 添加数据库的Maven依赖(SqlServer,Oracle)
- poj3252 Round Numbers
- Linux SSH: key, agent, keychain
- Web Server PROPFIND Method internal IP Discosure
- Ubuntu14.04安装PostpreSQL9.3.5记录
- PRINCE2学习
- prometheus rules
- JAR包结构,META-INF/MANIFEST.MF文件详细说明[全部属性][打包][JDK]
- HP Elitebook 830 G5/Win10蓝屏 UcmUcsi.sys 错误解决
- Mysql必知必会 第一章 了解SQL
- 【Linux命令】top命令
- loadrunner&#160;脚本录制-Protocol&#160;Advisor协议分析器的使用
- 开始写博客,学习Linq(2)
- Bubble Sort冒泡排序
- maven-javadoc-plugin 出现错误Unsupported major.minor version 51.0