http://scikit-learn.org/stable/modules/feature_extraction.html

带病在网吧里。

。。。。。

写。求支持。

。。

1、首先澄清两个概念:特征提取和特征选择(

Feature extraction is very different from Feature
selection

)。

the former consists in transforming arbitrary data, such as text or images, into numerical
features usable for machine learning. The latter is a machine learning technique applied on these features(从已经提取的特征中选择更好的特征).

以下分为四大部分来讲。主要还是4、text feature extraction

2、loading features form dicts

class DictVectorizer。举个样例就好:

>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Fransisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']



class DictVectorizer对于提取某个特定词汇附近的feature
windows很实用
,比如增加我们通过一个已有的algorithm提取了word ‘sat’ 在句子‘The cat sat on the mat.’中的PoS(Part
of Speech)特征。例如以下:

>>> pos_window = [
... {
... 'word-2': 'the',
... 'pos-2': 'DT',
... 'word-1': 'cat',
... 'pos-1': 'NN',
... 'word+1': 'on',
... 'pos+1': 'PP',
... },
... # in a real application one would extract many such dictionaries
... ]

上面的PoS特征就能够vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for
normalization):

>>>

>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized
<1x6 sparse matrix of type '<... 'numpy.float64'>'
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']

3、feature hashing

The class FeatureHasher is
a high-speed, low-memory vectorizer that uses a technique known as feature
hashing
, or the “hashing trick”.

因为hash。所以仅仅保存feature的interger index。而不保存原来feature的string名字。所以没有inverse_transform方法。

FeatureHasher 接收dict对,即 (feature, value) 对,或者strings,由构造函数的參数input_type决定.结果是scipy.sparse matrix。假设是strings,则value默认取1,比如 ['feat1', 'feat2', 'feat2'] 被解释为[('feat1', 1), ('feat2', 2)].

4、text feature extraction

由于内容太多,分开写了。參考着篇博客:http://blog.csdn.net/mmc2015/article/details/46997379

5、image feature extraction

提取部分图片(Patch extraction):

The extract_patches_2d function从图片中提取小块,存储成two-dimensional
array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d.
可以将全部的小块重构成原图:

>>> import numpy as np
>>> from sklearn.feature_extraction import image >>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]]) >>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]], [[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])

重构方式例如以下:

>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class和 extract_patches_2d,一样,仅仅只是能够同一时候接受多个图片作为输入:

>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)

图片像素的连接(Connectivity graph of an image):

主要是依据像素的区别来推断图片的每两个像素点是否连接。

。。

The function img_to_graph returns
such a matrix from a 2D or 3D image. Similarly, grid_to_graph build
a connectivity matrix for images given the shape of these image.

这有个直观的样例:http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py

头疼。。。。

碎觉。

。。

最新文章

  1. mysql Workbench 执行删除命令
  2. Step by Step 配置使用HTTPS的ASP.NET Web应用[转载]
  3. LeetCode33 Search in Rotated Sorted Array
  4. Unity3D游戏UI开发经验谈
  5. AES - Rijndael 算法(二)
  6. COJ 0260 HDNOIP201204四个国王
  7. __import__简介
  8. Java Object 引用传递和值传递
  9. 下载旧版chrome
  10. win7中python3.4下安装scrapy爬虫框架(亲测可用)
  11. .NET平台微服务项目汇集
  12. CentOS7下安装MariaDB
  13. 【Java学习笔记之三十四】超详解Java多线程基础
  14. NOIP2012 提高组 Day 2
  15. PWA初体验
  16. nova client和nova restfull api区别
  17. ES6躬行记(1)——let和const
  18. PostgreSQL、SQL Server数据库中的数据类型的映射关系
  19. js简单的面试题
  20. thinkphp 控制器unset删除对象变量失败。。

热门文章

  1. 在vue项目当中使用sass
  2. PAT 甲级 1003. Emergency (25)
  3. PHP获取今天开始和结束的时间戳
  4. 如何解决div里面img图片下方有空白的问题?
  5. 大文件LOG持续输出
  6. 富文本ZSSRichTextEditor之趟坑集锦
  7. js-图片img转base64格式
  8. 洛谷—— P2895 [USACO08FEB]流星雨Meteor Shower
  9. 第5章 Spring Boot 功能
  10. Oracle中PL/SQL 范例