Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain.

Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1 and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision based on the constraint w×x + b > 0, or equivalently w×x < b.

The problem now is that the weight w cannot encode a three-way choice. The three possible values of w×x are 0, w and 2×w. Either these three all lead to the same decision (they're all < b or ≥b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.(二分类问题,若dummy encoding,us和uk始终不能单独成为一类,而若one-hot encoding,则可以适应任何情况)

By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now w[UK]x[UK] + w[FR]x[FR] + w[US]x[US] < b, where all the x's are booleans. In this space, such a linear function can express any sum/disjunction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).

Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.

This is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

dataframe one-hot encoding:pandas.get_dummies方法

参考:

https://gist.github.com/ramhiser/982ce339d5f8c9a769a0

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.get_dummies.html

最新文章

  1. JS date常用代码积累
  2. 编写第一个MapReduce程序—— 统计气温
  3. Symantec Liveupdate Administrator的搭建
  4. HDU 2586 LCA
  5. JavaFX Application应用实例
  6. SWD接口:探索&amp;泄密&amp;延伸
  7. C# 读取指定URL的内容
  8. 使用JavaMail发送邮件和接受邮件
  9. windows 下 pgsql 拓展开启
  10. js 自定义方法 实现停留几秒 sleep
  11. iOS 监听声音按键
  12. WEP无线加密破解
  13. Kotlin 一个好用的新功能:Parcelize
  14. 深度学习识别CIFAR10:pytorch训练LeNet、AlexNet、VGG19实现及比较(三)
  15. antd-vue按需加载插件babel-plugin-import报错
  16. JMS Cluster modules
  17. Go基础系列:流程控制结构
  18. P2-Centos中安装vsftpd
  19. nginx配置集群
  20. Java编程的逻辑 (81) - 并发同步协作工具

热门文章

  1. mysql的字符串处理函数用法
  2. (26)zabbix脚本报警介质自定义(钉钉)
  3. datetime模块,random模块
  4. PyQt5(2)、垃圾分类小程序(2)——初代窗口程序可执行文件
  5. leepcode作业解析-5-15日
  6. 面向对象之多态,property
  7. Android目录结构
  8. Idea使用Tomcat乱码 tomcat 9.0 8.5.37乱码
  9. android 之 Intent、broadcast
  10. PAT天梯赛练习题——L3-008. 喊山(邻接表+BFS)