Recently, I was writing module of feature engineering, i found two excellently packages -- tsfresh and sklearn.

tsfresh has been specialized for data of time series, tsfresh mainly include two modules, feature extract, and feature select:

 from tsfresh import feature_selection, feature_extraction

To limit the number of irrelevant features tsfresh deploys the fresh algorithms. The whole process consists of three steps.

Firstly.  the algorithm characterizes time series with comprehensive and well-established feature mappings. the feature calculators used to derive the features are contained in tsfresh.feature_extraction.feature_calculators.

In a second step, each extracted feature vector is individually and evaluated with respect to its significance for predicting the target under investigation, those tests are contained in submodule tsfresh.feature_selection.significance_tests. the result of a significance test is a vector of p-value, quantifying the significance of each feature for predicting the target.

Finally, the vector of p-value is evaluated base on basis of the Benjamini-Yekutieli procedure in order to decide which feature could keep.

In summary, the tsfresh is a scalable and efficiency tool of feature engineering.

although the function of tsfresh was powerful, i choose sklearn.

I download data which is the heart disease data set. the data set target is binary and has a 13 dimension feature, I was just used MinMaxScaler to transform age,trestbps,chol three columns, the model had a choiced ensemble of AutoSklearnClassifer and ensemble of RandomForest. but bad performance for two models.

from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from numpy import set_printoptions, inf
set_printoptions(threshold=inf)
import pandas as pd
data = pd.read_csv("../data_set/heart.csv")
X = data[data.columns[:data.shape[1] - 1]].values
y = data[data.columns[-1]].values data = MinMaxScaler().fit_transform(X[:, [0, 3, 4, 7]])
X[:, [0, 3, 4, 7]] = data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) from autosklearn.classification import AutoSklearnClassifier
model_auto = AutoSklearnClassifier(time_left_for_this_task=120, n_jobs=3, include_preprocessors=["no_preprocessing"], seed=3)
model_auto.fit(x_train, y_train) from sklearn.metrics import accuracy_score
y_pred = model_auto.predict(x_test)
accuracy_score(y_test, y_pred) >>> 0.8021978021978022 from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=500)
y_pred_rf = model.predict(x_test)
accuracy_score(y_test, y_pred_rf) >>> 0.8051648351648352

My personal web site which provides automl service, I upload this data set to my service, it gets a better score than my code: http://simple-code.cn/

最新文章

  1. Atitit.研发管理如何避免公司破产倒闭的业务魔咒
  2. CSS笔记总结
  3. java HashMap那点事
  4. mysql-函数if多值多结果判断
  5. python调用ggsci.exe程序
  6. SQL到NOSQL的思维转变
  7. SQL 数据库 连接查询 变量、if else、while
  8. HDU 4089 Activation 概率DP 难度:3
  9. Ext.Net系列:一安装与使用
  10. 洗清UI自动化鸡肋说的不白之冤
  11. cocos中添加显示文字的三种方式(CCLabelTTF 、CCLabelBMFont 和CCLabelAtlas)
  12. 基于Struts2的用户登录程序
  13. java开发 时间类型的转换
  14. 突破路由mac地址过滤思路
  15. 上传文件 file upload 学习笔记
  16. 多少牛逼的程序员毁在low逼的英文发音上(JAVA)
  17. Achartengine.jar绘制动态图形一 --饼图
  18. vue+webpack项目 url的问题了解
  19. SLAM的前世今生
  20. [LeetCode] 490. The Maze_Medium tag: BFS/DFS

热门文章

  1. Webpack 中 file-loader 和 url-loader 的区别
  2. MySQL全文索引、联合索引、like查询、json查询速度大比拼
  3. 解释为什么wait()和notify(), notifyAll()要放在同步块中
  4. python+selenium自动化测试,浏览器最大化报错解决方法
  5. IIS网站部署配置
  6. .NET CORE(C#) WPF 方便的实现用户控件切换(祝大家新年快乐)
  7. Java基础之一、入门知识
  8. ubuntu18.04误删apt-get命令恢复总结
  9. BUGFIX 09 - 记一次Java中String的split正则表达式匹配 - 引发`OutOfMemoryError: Java heap space`的oom异常 排查及解决 -Java根据指定分隔符分割字符串,忽略在引号里面的分隔符
  10. Java源码系列2——HashMap