汇总的函数

方法 说明
count 非NA的值数量
describe 针对Series和DataFrame列计算汇总统计
min、max 计算最小值和最大值
argmin、argmax 计算能够获取到最小值和最大值的索引位置
idxmin、indxmax 计算能够获取到最小值和最大值的索引值
quantile 计算四分位数
sum 值的总和
mean 值的平均数
median 值的算术中位数(第50百分位数)
mad 根据平均值计算平均绝对离差
var 样本值的方差
std 样本值的标准差
skew 样本值的偏度
kurt 样本值的峰度
cumsum 样本值的累计和
cummin、cummax 累计最大值和累计最小值
cumprod 累计积
diff 计算一阶差分(对时间序列很有用)
pct_change 计算百分数变化
import numpy as np
import pandas as pd
from pandas import Series df = pd.DataFrame([[1.4, np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
index=['a','b','c','d'],
columns = ['one','two']
)
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
# 求每一列和
df.sum() one 9.25
two -5.80
dtype: float64
# 求每一行和
df.sum(axis=1) a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
选项 说明
axis 约简的轴,axis=1代表统计一行的值,默认统计一列的值。
skipna 排除缺失值,默认值为True
level 如果轴是层次化索引的,则根据level分组约简
# 返回最大值最小值的索引
df.idxmin()
one d
two b
dtype: object # 累积列的和
df.cumsum()
### 汇总分析
df.describe() one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000 # 对于非数值型数据,describe会产生另一种汇总统计
obj = Series(['a','a','b','c']*4)
obj.describe() count 16
unique 3
top a
freq 8
dtype: object

相关系数和协方差

# 需要先pip install pandas_datareader
import pandas_datareader.data as web # 远程获取股票的数据
all_data={}
for ticker in ['AAPL','IBM','MSFT']:
all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
all_data {'AAPL': High Low Open Close Volume Adj Close
Date
1999-12-31 3.674107 3.553571 3.604911 3.671875 40952800.0 2.467498
2000-01-03 4.017857 3.631696 3.745536 3.997768 133949200.0 2.686497
2000-01-04 3.950893 3.613839 3.866071 3.660714 128094400.0 2.459998
2000-01-05 3.948661 3.678571 3.705357 3.714286 194580400.0 2.495997
2000-01-06 3.821429 3.392857 3.790179 3.392857 191993200.0 2.279998
2000-01-07 3.607143 3.410714 3.446429 3.553571 115183600.0 2.387998
2000-01-10 3.651786 3.383929 3.642857 3.491071 126266000.0 2.345998
2000-01-11 3.549107 3.232143 3.426339 3.312500 110387200.0 2.225998
2000-01-12 3.410714 3.089286 3.392857 3.113839 244017200.0 2.092498
2000-01-13 3.526786 3.303571 3.374439 3.455357 258171200.0 2.321998
2000-01-14 3.651786 3.549107 3.571429 3.587054 97594000.0 2.410497
2000-01-18 3.785714 3.587054 3.607143 3.712054 114794400.0 2.494497
2000-01-19 3.883929 3.691964 3.772321 3.805804 149410800.0 2.557498
2000-01-20 4.339286 4.053571 4.125000 4.053571 457783200.0 2.723998
2000-01-21 4.080357 3.935268 4.080357 3.975446 123981200.0 2.671498
2000-01-24 4.026786 3.754464 3.872768 3.794643 110219200.0 2.549998
2000-01-25 4.040179 3.656250 3.750000 4.008929 124286400.0 2.693998
2000-01-26 4.078125 3.919643 3.928571 3.935268 91789600.0 2.644498
2000-01-27 4.035714 3.821429 3.886161 3.928571 85036000.0 2.639997
2000-01-28 3.959821 3.593750 3.863839 3.629464 105837200.0 2.438998
2000-01-31 3.709821 3.375000 3.607143 3.705357 175420000.0 2.489997
2000-02-01 3.750000 3.571429 3.714286 3.580357 79508800.0 2.405998
2000-02-02 3.647321 3.464286 3.598214 3.529018 116048800.0 2.371498
2000-02-03 3.723214 3.580357 3.582589 3.689732 118798400.0 2.479497
2000-02-04 3.928571 3.700893 3.712054 3.857143 106330000.0 2.591997
2000-02-07 4.080357 3.783482 3.857143 4.073661 110266800.0 2.737498
2000-02-08 4.147321 3.973214 4.071429 4.102679 102160800.0 2.756998
2000-02-09 4.183036 4.015625 4.075893 4.022321 74841200.0 2.702997
2000-02-10 4.066964 3.928571 4.031250 4.053571 75745600.0 2.723998
2000-02-11 4.075893 3.866071 4.058036 3.883929 53062800.0 2.609998
... ... ... ... ... ... ...
2009-11-18 29.571428 29.142857 29.505714 29.422857 93580200.0 19.772139
2009-11-19 29.230000 28.542856 29.230000 28.644285 135581600.0 19.248940
2009-11-20 28.627142 28.251429 28.450001 28.559999 101666600.0 19.192299
2009-11-23 29.428572 28.992857 29.000000 29.411428 118724200.0 19.764460
2009-11-24 29.411428 28.985714 29.332857 29.205715 79609600.0 19.626223
2009-11-25 29.378571 29.108572 29.342857 29.170000 71613500.0 19.602222
2009-11-27 28.994286 28.338572 28.459999 28.655714 73814300.0 19.256622
2009-11-30 28.811428 28.395714 28.730000 28.558571 106214500.0 19.191339
2009-12-01 28.967142 28.118572 28.891428 28.138571 116440800.0 18.909100
2009-12-02 28.774286 27.964285 28.422857 28.032858 178815000.0 18.838060
2009-12-03 28.425714 28.038572 28.202858 28.068571 112179900.0 18.862062
2009-12-04 28.554285 27.182858 28.528572 27.617144 206721200.0 18.558701
2009-12-07 27.681429 26.954287 27.617144 26.992857 178689700.0 18.139185
2009-12-08 27.478571 26.957144 27.051428 27.124287 172599700.0 18.227499
2009-12-09 28.308571 27.187143 27.325714 28.257143 171195500.0 18.988783
2009-12-10 28.528572 28.017143 28.500000 28.061428 122417400.0 18.857262
2009-12-11 28.285715 27.632856 28.254286 27.809999 107443700.0 18.688299
2009-12-14 28.204287 27.508572 27.910000 28.139999 123947600.0 18.910061
2009-12-15 28.215714 27.610001 27.975714 27.738571 104864900.0 18.640306
2009-12-16 28.071428 27.792856 27.871429 27.861429 88246200.0 18.722862
2009-12-17 27.857143 27.285715 27.751429 27.408571 97209700.0 18.418543
2009-12-18 27.928572 27.514286 27.595715 27.918571 152192600.0 18.761261
2009-12-21 28.535715 27.952858 28.007143 28.318571 152976600.0 19.030060
2009-12-22 28.692858 28.379999 28.491428 28.622858 87378900.0 19.234541
2009-12-23 28.911428 28.687143 28.742857 28.871429 86381400.0 19.401581
2009-12-24 29.907143 29.049999 29.078571 29.862858 125222300.0 20.067825
2009-12-28 30.564285 29.944286 30.245714 30.230000 161141400.0 20.314537
2009-12-29 30.388571 29.818571 30.375713 29.871429 111301400.0 20.073582
2009-12-30 30.285715 29.758572 29.832857 30.234285 103021100.0 20.317423
2009-12-31 30.478571 30.080000 30.447144 30.104286 88102700.0 20.230061 [2516 rows x 6 columns], price = pd.DataFrame({tic:data['Adj Close'] for tic, data in all_data.items()})
volume = pd.DataFrame({tic:data['Volume'] for tic, data in all_data.items()}) # 计算价格的百分位数变化
returns = price.pct_change()
# 结尾取5个,取局部的
returns.tail() AAPL IBM MSFT
Date
2009-12-24 0.034340 0.004385 0.002587
2009-12-28 0.012294 0.013326 0.005484
2009-12-29 -0.011861 -0.003477 0.007058
2009-12-30 0.012147 0.005461 -0.013698
2009-12-31 -0.004300 -0.012597 -0.015504

Series的corr方法用于计算两个Serires重叠的、非NA的、按索引对齐的值的相关系数。COV用于计算协方差。

# 计算MSFT和IBM的相关系数
returns.MSFT.corr(returns.IBM)
0.49253706924390156 # 计算MSFT和IBM的协方差
returns.MSFT.cov(returns.IBM)
0.00021557776646297303 DataFrame的corr和cov方法,可以统计出任意两者之间的相关系数和协方差
returns.corr() AAPL IBM MSFT
AAPL 1.000000 0.412392 0.422852
IBM 0.412392 1.000000 0.492537
MSFT 0.422852 0.492537 1.000000 # 返回单列的IBM与另外三者之间的相关系数
returns.corrwith(returns.IBM)
AAPL 0.412392
IBM 1.000000
MSFT 0.492537
dtype: float64 # 百分比变化和成交量的相关系数 returns.corrwith(volume)
AAPL -0.057665
IBM -0.006592
MSFT -0.016101
dtype: float64 returns.cov() AAPL IBM MSFT
AAPL 0.001030 0.000254 0.000309
IBM 0.000254 0.000369 0.000216
MSFT 0.000309 0.000216 0.000519

唯一值(unique),返回一个Series,其索引为唯一值,其值为频率

obj = Series(['c','a','d','a','a','b','b','c','c'])
uniques = obj.unique()
uniques array(['c', 'a', 'd', 'b'], dtype=object) # 对返回的结果可以进行排序
uniques.sort()
uniques array(['a', 'b', 'c', 'd'], dtype=object) # 统计值出现的次数,为了方便查看按值频率降序排序
obj.value_counts() a 3
c 3
b 2
d 1
dtype: int64 # pandas中需要将一维数组传入进去
pd.value_counts(obj.values) a 3
c 3
b 2
d 1
dtype: int64

成员资格isin 计算一个表示"是否在数组中"布尔型数组

mask = obj.isin(['b','c'])
# 第一步先选出是否在,返回bool
mask 0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool # 选择真值
obj[mask] 0 c
5 b
6 b
7 c
8 c
dtype: object data = pd.DataFrame({'Qu1':[100,300,400,300,400],
'Qu2':[200,300,100,200,300],
'Qu3':[100,500,200,400,400]
})
data Qu1 Qu2 Qu3
0 100 200 100
1 300 300 500
2 400 100 200
3 300 200 400
4 400 300 400 p=data.apply(pd.value_counts).fillna(0)
p Qu1 Qu2 Qu3
100 1.0 1.0 1.0
200 0.0 2.0 1.0
300 2.0 2.0 0.0
400 2.0 0.0 2.0
500 0.0 0.0 1.0

处理缺失数据

NA处理方法

方法 说明
dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度
fillna 用指定值或插值方法填充缺失数据
isnull 返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值
notnull isnull的否定式

过滤掉

from numpy import nan as NA
data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data 0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0 # 默认会把所有的有NA的行删除
data.dropna()
0 1 2
0 1.0 6.5 3.0 # 传入how='all'将只丢弃全为NA的行
data.dropna(how='all') 0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0 data[4]=NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN # axis=1 删除全部为NA的列
data.dropna(axis=1,how='all') 0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0 df = pd.DataFrame(np.random.randn(7,5))
df.ix[:4,1] = NA
df.ix[:2,2] = NA 0 1 2 3 4
0 0.758567 NaN NaN -0.064858 -0.385678
1 -0.275586 NaN NaN -0.184934 -0.253343
2 -1.872585 NaN NaN -1.539924 0.794054
3 1.092201 NaN 0.250026 -0.654255 -0.016992
4 0.625871 NaN -1.418505 1.141008 3.188408
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.412723 -0.405043 -0.417018 -1.114318 #这里的thresh函数是选取最少non-NA值个数的行选出来
df.dropna(thresh=5) 0 1 2 3 4
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.412723 -0.405043 -0.417018 -1.114318 #### 填充缺失数据,如果不想过滤的话
df.fillna(0)
# 字典调用fillna,对不同的列填充不同的值
df.fillna({1:'第一列',2:'第二列'}) 0 1 2 3 4
0 0.758567 第一列 第二列 -0.064858 -0.385678
1 -0.275586 第一列 第二列 -0.184934 -0.253343
2 -1.872585 第一列 第二列 -1.539924 0.794054
3 1.092201 第一列 0.250026 -0.654255 -0.016992
4 0.625871 第一列 -1.41851 1.141008 3.188408
5 -0.714581 0.423811 -0.799328 -1.010573 -0.959299
6 0.887836 1.41272 -0.405043 -0.417018 -1.114318 # 上面的都是返回值为新的对象,如果直接向对原数据修改,inplace = True
df.fillna({1:'第一列',2:'第二列'},inplace=True) # 向上,向下填充(bfill,ffill)
df = pd.DataFrame(np.random.randn(6,3)) 0 1 2
0 2.040458 -2.276695 -1.038916
1 0.427078 NaN 0.001678
2 1.798042 NaN -0.839205
3 -0.433214 -0.312427 NaN
4 0.041802 1.356883 NaN
5 -0.904390 -1.030643 1.507598 df.fillna(method='ffill',limit=1) 0 1 2
0 2.040458 -2.276695 -1.038916
1 0.427078 -2.276695 0.001678
2 1.798042 NaN -0.839205
3 -0.433214 -0.312427 -0.839205
4 0.041802 1.356883 NaN
5 -0.904390 -1.030643 1.507598 #还可以传入Series的平均值或中位数
data = Series([1,NA,3.5])
data.fillna(data.mean()) 0 1.00
1 2.25
2 3.50
dtype: float64

最新文章

  1. Oracle备库TNS连接失败的分析
  2. NOIP2013火柴排队[逆序对]
  3. MySQL replace into 使用详解 及 注意事项
  4. 使用Python结合Face++ API识别人脸
  5. Javascript对象创建
  6. Servlet/JSP-03 HttpServlet
  7. DLL项目报错:fatal error lnk1104: cannot open file "...\xxx.dll"
  8. 【avalon】data
  9. Python核心编程--学习笔记--5--数字
  10. 初步窥探Git
  11. No modifications are allowed to a locked ParameterMap
  12. JS异步阻塞的迷思
  13. cf B Three matrices
  14. Xcode7真机测试
  15. 获取xml文件
  16. ToolBox Analysis & Design
  17. python爬虫-抓取acg12动漫壁纸排行设置为桌面壁纸
  18. Orleans例子源码
  19. 使用Eclipse来操作HDFS的文件
  20. zookeeper名字服务

热门文章

  1. RecyclerView通用适配器
  2. Maven项目的生命周期
  3. 使用MagicAJax的AjaxPanel时有时会弹出"Using the AjaxCallHelper write methods outside of an AjaxCall is not allowed."
  4. mongo admin 客户端管理工具安装
  5. MySQL5.7.20 二进制包无ROOT权限下安装, 滴滴云服务器
  6. sonar自定义规则
  7. 采用EntityFramework.Extended 对EF进行扩展
  8. JAVA_Package
  9. CAS单点登录原理解析
  10. [Shell]Shell脚本的执行方式