R 网页数据爬虫1
1.WHY R?
#1.FOR a software environment with a primarily statistical focus.
#2.there will be an amazing visual work.
#May be a complete set of operational procedures.
2.About basics.
we need threw ourselves into the preparation with some basic knowledge of HTML, XML and the logic of regular expressions and Xpath, BUT the operations are executed from WIHTIN R!
3.RECOMMENDATION
http://www.r-datacollection.com
4.A little case study.
#爬取电影票房信息
library(stringr)
library(XML)
library(maps)
#htmlParse()用来interpreting HTML
#创建一个object
movie_parsed<-htmlParse("http://58921.com/boxoffice/wangpiao/20161004",
encoding = "UTF-8")
#the next step:extract tables/data
#readHTMLTable() for identifying and reading out those tables
tables<-readHTMLTable(movie_parsed,stringsAsFactors=FALSE)
is.matrix(tables)
is.character(tables)
is.data.frame(tables)
is.list(tables)
#so we got an "list" format#
因为R对于中文的支持不是很好,所以碰到一些中文乱码是正常的,所以我们需要more advanced text manipulation tools.(本例中出现了部分列信息的完全丢失是因为该网站的某些列的数据是以.png格式放置的。)
5.ABC's of...
For browsing the Web, there is a hidden standard behind the scenes that structures how information is displayed.
#HTML or the hypertext markup language
Not a dedicated data storage format, but usually contains the useful information. And in general HTML is used to shape the display of information.
#XML the extensible markup language or XML
The main purpose of XML is to storage data. Thus HTML documents are interpreted and transformed in to pretty-looking output by browsers, whereas XML is "just" data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. Both HTML and XML-style document offer natrual, often hierarchical, structures for data storage.
(unfinished......)
#JSON or JavaScript Object Notation
基于JavaScript语言的轻量级的数据交换格式
#AJAX or "Asynchronous JavaScript and XML"
____________________________________________________________________________________________
HTTP | R |
XML/HTML | XPath |
JSON | JSON parsers |
AJAX | Selenuim |
Plain text | Regular expressions |
最新文章
- First Day:Starting My Coding Road
- SharePoint 2013 定制搜索显示模板(二)
- SDWebImage原理及使用
- dotnetbar入门
- Photoshop CS6的安装
- linux操作Oracle导入导出dmp数据命令
- VC版本的MakeObjectInstance把WNDPROC映射到类的成员函数
- jQuery.extend()、jQuery.fn.extend()扩展方法示例详解
- win7(32 bit) 环境下点击打印预览报错解决办法
- iOS开发宝典:String用法大全
- wemall app商城源码中基于PHP的ThinkPHP惯例配置文件代码
- mysql 查找某个表在哪个库
- HDU - 2181 dfs [kuangbin带你飞]专题二
- Python之路(第十七篇)logging模块
- docker+efk+.net core部署
- UITableView 显示在statusbar 下面
- (转)SQL Server 2008无法修改表的解决办法
- ubuntu16更新源
- 浅谈css中浮动和清除浮动带来的影响
- Java中的文件操作(一)RandomAccessFile