申明:本系列文章是自己在学习《利用Python进行数据分析》这本书的过程中,为了方便后期自己巩固知识而整理。

1 pandas读取文件的解析函数

read_csv 读取带分隔符的数据,默认分隔符 逗号

read_table 读取带分隔符的数据,默认分隔符 “\t”

read_fwf 读取定宽、列格式数据(无分隔符)

read_clipboard 读取剪贴板中的数据(将网页转换为表格)

1.1 读取excel数据

import pandas as pd
import numpy as np
file = 'D:\example.xls'
pd = pd.read_excel(file)
pd

运行结果:

1.1.1 不显示表头

pd = pd.read_excel(file,header=None)

运行结果:

1.1.2 设置表头

pd = pd.read_excel(file,names=['Year','Name','Math','Chinese','EngLish','Avg'])

运行结果:

1.1.3 指定索引

pd = pd.read_excel(file,index_col= '姓名')

运行结果:

2 读取CSV数据

import pandas as pd
import numpy as np
pd = pd.read_csv("d:\\test.csv",engine='python')
pd

运行结果:

import pandas as pd
import numpy as np
pd = pd.read_table("d:\\test.csv",engine='python')
pd

运行结果:

import pandas as pd
import numpy as np
pd = pd.read_fwf("d:\\test.csv",engine='python')
pd

运行结果:

3 将数据写出到文本格式

将数据写出到csv格式,默认分隔符 逗号

import pandas as pd
import numpy as np
pd = pd.read_fwf("d:\\test.csv",engine='python')
pd.to_csv("d:\\test1.csv",encoding='gbk')

运行结果:

4 手工处理分隔符格式

单字符分隔符文件,直接用csv模块

import pandas as pd
import numpy as np
import csv
file = 'D:\\test.csv'
pd = pd.read_csv(file,engine='python')
pd.to_csv("d:\\test1.csv",encoding='gbk',sep='/')
f = open("d:\\test1.csv")
reader = csv.reader(f)
for line in reader:
print(line)

运行结果:

4.1 缺失值填充

import pandas as pd
import numpy as np
import csv
file = 'D:\\test.csv'
pd = pd.read_csv(file,engine='python')
pd.to_csv("d:\\test1.csv",encoding='gbk',sep='/',na_rep='NULL')
f = open("d:\\test1.csv")
reader = csv.reader(f)
for line in reader:
print(line)

运行结果:

4.2 JSON

4.2.1 json.loads 可将JSON字符串转换成Python形式

import pandas as pd
import numpy as np
import json
obj = """{
"sucess" : "1",
"header" : {
"version" : 0,
"compress" : false,
"times" : 0
},
"data" : {
"name" : "BankForQuotaTerrace",
"attributes" : {
"queryfound" : "1",
"numfound" : "1",
"reffound" : "1"
},
"columnmeta" : {
"a0" : "DATE",
"a1" : "DOUBLE",
"a2" : "DOUBLE",
"a3" : "DOUBLE",
"a4" : "DOUBLE",
"a5" : "DOUBLE",
"a6" : "DATE",
"a7" : "DOUBLE",
"a8" : "DOUBLE",
"a9" : "DOUBLE",
"b0" : "DOUBLE",
"b1" : "DOUBLE",
"b2" : "DOUBLE",
"b3" : "DOUBLE",
"b4" : "DOUBLE",
"b5" : "DOUBLE"
},
"rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ]
}
}
"""
result = json.loads(obj)
result

运行结果:

4.2.2 json.dumps可将Python字符串转换成JSON形式

result = json.loads(obj)
asjson=json.dumps(result)
asjson

运行结果:

4.2.3 JSON数据转换成DataFrame

import pandas as pd
import numpy as np
from pandas import DataFrame
import json
obj = """{
"sucess" : "1",
"header" : {
"version" : 0,
"compress" : false,
"times" : 0
},
"data" : {
"name" : "BankForQuotaTerrace",
"attributes" : {
"queryfound" : "1",
"numfound" : "1",
"reffound" : "1"
},
"columnmeta" : {
"a0" : "DATE",
"a1" : "DOUBLE",
"a2" : "DOUBLE",
"a3" : "DOUBLE",
"a4" : "DOUBLE",
"a5" : "DOUBLE",
"a6" : "DATE",
"a7" : "DOUBLE",
"a8" : "DOUBLE",
"a9" : "DOUBLE",
"b0" : "DOUBLE",
"b1" : "DOUBLE",
"b2" : "DOUBLE",
"b3" : "DOUBLE",
"b4" : "DOUBLE",
"b5" : "DOUBLE"
},
"rows" : [ [ "2017-10-28", 109.8408691012081, 109.85566362201733, 0.014794520809225841, 1.0, null, "", 5.636678251676443, 5.580869556115291, 37.846934105222246, null, null, null, null, null, 0.061309012867495856 ] ]
}
}
"""
result = json.loads(obj)
result
jsondf = DataFrame(result['data'],columns = ['name','attributes','columnmeta'],index={1,2,3})
jsondf

运行结果:

备注:其中attributes和columnmeta,存在嵌套,这个问题后面再补充。

4.3 XML和HTML

爬取同花顺网页中的列表数据,并转换成DataFrame

在爬取的时候,我这里没有考虑爬分页的数据,有兴趣的可以自己尝试,我这里主要是想尝试爬取数据后转成DataFrame

代码如下:

import pandas as pd
import numpy as np
from pandas.core.frame import DataFrame
from lxml.html import parse
import requests
from bs4 import BeautifulSoup
import time url = 'http://data.10jqka.com.cn/market/longhu/'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
response = requests.get(url = url,headers = headers)
html = response.content
soup = BeautifulSoup(html, 'lxml')
s = soup.find_all('div','yyb') # 获取dataframe所需的columns
def getcol():
col = []
for i in s:
lzs = i.find_all('thead')
for k in lzs:
lbs = k.find_all('th')
for j in lbs:
col.append(j.text.strip('\n'))
return col # 获取dataframe所需的values
def getvalues():
val = []
for j in s:
v = j.find_all('tbody')
for k in v:
vv = k.find_all('tr')
list = []
for l in vv:
tdlist = []
vvv = l.find_all('td')
for m in vvv:
tdlist.append(m.text)
list.append(tdlist)
return(list) if __name__ == "__main__":
cols = getcol()
values = getvalues()
data=DataFrame(values,columns=cols)
print(data)

运行结果:

4.4 二进制数据格式

pandas对象的save方法保存,load方法读回到Python

4.5 HDF5格式

HDF是层次型数据格式,HDF5文件含一个文件系统式的节点结构,支持多个数据集、元数据,可以高效的分块读写。Python中的HDF5库有2个接口:PyTables和h5py。

海量数据应该考虑用这个,现在我没用着,先不研究了。

4.6 使用HTML和Web API

import requests
import pandas as pd
from pandas import DataFrame
import json
url = 'http://t.weather.sojson.com/api/weather/city/101030100'
resp = requests.get(url)
data = json.loads(resp.text)#这里的data是一个dict
jsondf = DataFrame(data['cityInfo'],columns =['city','cityId','parent','updateTime'],index=[1])#实例化
jsondf

运行结果:

4.7 使用数据库

4.7.1 sqlite3

import sqlite3
import pandas.io.sql as sql
con = sqlite3.connect()
sql.read_frame('select * from test',con)#con 是一个连接对象

4.7.1 MongoDB

没装。先搁置。

最新文章

  1. postgresql 常用数据库命令
  2. DecoratorPattern(装饰器模式)
  3. mybatis 中${}和#{}区别
  4. VisualSVN Server的配置和使用方法 图文
  5. jmeter的压力测试
  6. NOI2004 郁闷的出纳员
  7. document.all的详细解释(document.all基本上所有浏览器可用!)
  8. 基于阿里云服务器的git服务器搭建
  9. [译]Python编写虚拟解释器
  10. findBugs学习小结
  11. 使用soapUI代替WSDL2JAVA生成cxf HTTPS 客户端调用代码
  12. linux下javaEE系统安装部署
  13. Cloud Foundry中通用service的集成
  14. 模板引擎mustache.js
  15. iOS 之 微信开发流程
  16. 一步一步学MySQL-日志文件
  17. Linux平台 Oracle 12cR2 RAC安装Part3:DB安装
  18. hadoop hdfs 高可用
  19. Promise库
  20. weblogic系列漏洞整理 -- 4. weblogic XMLDecoder 反序列化漏洞(CVE-2017-10271、CVE-2017-3506)

热门文章

  1. 探究机器码,深入研究C语言程序的机制
  2. buoyantSimpleFoam求解器:恒热流壁面【翻译】
  3. Java 实例 - instanceof 关键字用法
  4. 更换镜像加快python pip 安装扩展库的速度
  5. CTR预估之LR与GBDT融合
  6. 终极解决办法rvct Cannot obtain license for Compiler (feature compiler) with license version >= 3.1
  7. C++线程互斥、同步
  8. PAT 甲级 1032 Sharing (25 分)(结构体模拟链表,结构体的赋值是深拷贝)
  9. lombok编译时注解@Slf4j的使用及相关依赖包
  10. python配置yum源