• 需求:

    • 导入文件,查看原始数据
    • 将人口数据和各州简称数据进行合并
    • 将合并的数据中重复的abbreviation列进行删除
    • 查看存在缺失数据的列
    • 找到有哪些state/region使得state的值为NaN,进行去重操作
    • 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
    • 合并各州面积数据areas
    • 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
    • 去除含有缺失数据的行
    • 找出2010年的全民人口数据
    • 计算各州的人口密度
    • 排序,并找出人口密度最高的五个州 df.sort_values()
import numpy as np
from pandas import DataFrame,Series
import pandas as pd
abb = pd.read_csv('./data/state-abbrevs.csv')
abb.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state abbreviation
0 Alabama AL
1 Alaska AK
pop = pd.read_csv('./data/state-population.csv')
pop.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
area = pd.read_csv('./data/state-areas.csv')
area.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
# 将人口数据和各州简称数据进行合并
abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region',how='outer')
abb_pop.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state abbreviation state/region ages year population
0 Alabama AL AL under18 2012 1117489.0
1 Alabama AL AL total 2012 4817528.0
# 将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
abb_pop.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state state/region ages year population
0 Alabama AL under18 2012 1117489.0
1 Alabama AL total 2012 4817528.0
# 查看存在缺失数据的列
abb_pop.isnull().any(axis=0)
state            True
state/region False
ages False
year False
population True
dtype: bool
# 找到有哪些state/region使得state的值为NaN,进行去重操作
# 1.state列中哪些值为空
abb_pop['state'].isnull()
0       False
1 False
2 False
...
2542 True
2543 True
Name: state, Length: 2544, dtype: bool
# 2.可以将step1中空对应的行数据取出(state中的空值对应的行数据)
abb_pop.loc[abb_pop['state'].isnull()]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state state/region ages year population
2448 NaN PR under18 1990 NaN
... ... ... ... ... ...
2543 NaN USA total 2012 313873685.0

96 rows × 5 columns

# 3.将对应的行数据中指定的简称列取出
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()
array(['PR', 'USA'], dtype=object)
# 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
# 1.先将USA对应的state列中的空值定位到
abb_pop['state/region'] == 'USA'
0       False
1 False
2 False
3 False
...
2541 True
2542 True
2543 True
Name: state/region, Length: 2544, dtype: bool
# 2,将布尔值作为原数据的行索引,取出USA简称对应的行数据
abb_pop.loc[abb_pop['state/region'] == 'USA']

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state state/region ages year population
2496 NaN USA under18 1990 64218512.0
... ... ... ... ... ...
2542 NaN USA under18 2012 73708179.0
2543 NaN USA total 2012 313873685.0
# 3.获取符合要求行数据的行索引
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
# 4.将indexs这些行中的state列的值批量赋值成united states
abb_pop.loc[indexs,'state'] = 'United Status'
# 将PR对应的state列中的空批量赋值成 PUERTO RICO
abb_pop['state/region'] == 'PR'
abb_pop.loc[abb_pop['state/region'] == 'PR']
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
abb_pop.loc[indexs,'state'] = 'PUERTO RICO'
# 合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head(3)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state state/region ages year population area (sq. mi)
0 Alabama AL under18 2012.0 1117489.0 52423.0
1 Alabama AL total 2012.0 4817528.0 52423.0
2 Alabama AL under18 2010.0 1130966.0 52423.0
# 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
abb_pop_area['area (sq. mi)'].isnull()
# 将空值对应的行数据取出
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs
Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
2492, 2493, 2494, 2495, 2496, 2497, 2498, 2499, 2500, 2501, 2502,
2503, 2504, 2505, 2506, 2507, 2508, 2509, 2510, 2511, 2512, 2513,
2514, 2515, 2516, 2517, 2518, 2519, 2520, 2521, 2522, 2523, 2524,
2525, 2526, 2527, 2528, 2529, 2530, 2531, 2532, 2533, 2534, 2535,
2536, 2537, 2538, 2539, 2540, 2541, 2542, 2543],
dtype='int64')
# 去除含有缺失数据的行
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)
# 找出2010年的全民人口数据    条件查询
abb_pop_area.query('year == 2010 & ages == "total"')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}

...
...

state state/region ages year population area (sq. mi)
3 Alabama AL total 2010.0 4785570.0 52423.0
91 Alaska AK total 2010.0 713868.0 656425.0
101 Arizona AZ total 2010.0 6408790.0 114006.0
189 Arkansas AR total 2010.0 2922280.0 53182.0
197 California CA total 2010.0 37333601.0 163707.0
2405 Wyoming WY total 2010.0 564222.0 97818.0
# 计算各州的人口密度
abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head(2)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state state/region ages year population area (sq. mi) midu
0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769
1 Alabama AL total 2012.0 4817528.0 52423.0 91.897221
# 排序,并找出人口密度最高的五个州   df.sort_values()
abb_pop_area.sort_values(by='midu',axis=0,ascending=False).head(5)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state state/region ages year population area (sq. mi) midu
391 District of Columbia DC total 2013.0 646449.0 68.0 9506.602941
385 District of Columbia DC total 2012.0 633427.0 68.0 9315.102941
387 District of Columbia DC total 2011.0 619624.0 68.0 9112.117647
431 District of Columbia DC total 1990.0 605321.0 68.0 8901.779412
389 District of Columbia DC total 2010.0 605125.0 68.0 8898.897059
abb_pop_area.groupby(by='state')['area (sq. mi)'].max().sort_values(ascending=False).head(5)
state
Alaska 656425.0
Texas 268601.0
California 163707.0
Montana 147046.0
New Mexico 121593.0
Name: area (sq. mi), dtype: float64

最新文章

  1. linux终端 字符界面 显示乱码
  2. .Net JIT
  3. 委托 与 Lambda
  4. 使用vim在Linux下编写C语言程序
  5. grid列的值格式化
  6. PHP工厂模式的研究
  7. 最长递增子序列 O(NlogN)算法
  8. jython学习笔记3
  9. C#计算时间差值
  10. k近邻法的C++实现:kd树
  11. setcookie,getcookie,delcookie,setpostBgPic
  12. SQL Server判断对象是否存在 (if exists (select * from sysobjects )(转)
  13. Swift学习之UI开发初探
  14. jQuery复习:第四章
  15. 怎么配置Jupyter Notebook默认启动目录?
  16. 通俗大白话来理解TCP协议的三次握手和四次断开
  17. jQuery学习- 表单事件
  18. java微信开发之接受消息回复图片或者文本
  19. Python学习---Django的基础操作180116
  20. 后台判断ajax请求的请求后字段

热门文章

  1. vue.js 笔记
  2. Jquery异步上传文件
  3. hdu 6035:Colorful Tree (2017 多校第一场 1003) 【树形dp】
  4. php接受post传值的方法
  5. CSS中浮动属性float及清除浮动
  6. [CSP-S模拟测试]:Simple(数学)
  7. 一个关于 ie 浏览器的 bug 解决过程和思考
  8. 生产环境下,oracle不同用户间的数据迁移。第三部分
  9. HDU 1003 解题报告
  10. Jsoup获取DOM元素