1.WHY R?

#1.FOR a software environment with a primarily statistical focus.

#2.there will be an amazing visual work.

#May be a complete set of operational procedures.

2.About basics.

we need threw ourselves into the preparation with some basic knowledge of HTML, XML and the logic of regular expressions and Xpath, BUT the operations are executed from WIHTIN R!

3.RECOMMENDATION

http://www.r-datacollection.com

4.A little case study.

#爬取电影票房信息
library(stringr)
library(XML)
library(maps)
#htmlParse()用来interpreting HTML
#创建一个object
movie_parsed<-htmlParse("http://58921.com/boxoffice/wangpiao/20161004",
encoding = "UTF-8")
#the next step:extract tables/data
#readHTMLTable() for identifying and reading out those tables
tables<-readHTMLTable(movie_parsed,stringsAsFactors=FALSE)
is.matrix(tables)
is.character(tables)
is.data.frame(tables)
is.list(tables)
#so we got an "list" format#

因为R对于中文的支持不是很好,所以碰到一些中文乱码是正常的,所以我们需要more advanced text manipulation tools.(本例中出现了部分列信息的完全丢失是因为该网站的某些列的数据是以.png格式放置的。)

5.ABC's of...

For browsing the Web, there is a hidden standard behind the scenes that structures how information is displayed.

#HTML or the hypertext markup language

Not a dedicated data storage format, but usually contains the useful information. And in general HTML is used to shape the display of information.

#XML the extensible markup language or XML

The main purpose of XML is to storage data. Thus HTML documents are interpreted and transformed in to pretty-looking output by browsers, whereas XML is "just" data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. Both HTML and XML-style document offer natrual, often hierarchical, structures for data storage.

(unfinished......)

#JSON or JavaScript Object Notation

基于JavaScript语言的轻量级的数据交换格式

#AJAX or "Asynchronous JavaScript and XML"

____________________________________________________________________________________________

HTTP R
XML/HTML XPath
JSON JSON parsers
AJAX Selenuim
Plain text                Regular expressions

最新文章

  1. First Day:Starting My Coding Road
  2. SharePoint 2013 定制搜索显示模板(二)
  3. SDWebImage原理及使用
  4. dotnetbar入门
  5. Photoshop CS6的安装
  6. linux操作Oracle导入导出dmp数据命令
  7. VC版本的MakeObjectInstance把WNDPROC映射到类的成员函数
  8. jQuery.extend()、jQuery.fn.extend()扩展方法示例详解
  9. win7(32 bit) 环境下点击打印预览报错解决办法
  10. iOS开发宝典:String用法大全
  11. wemall app商城源码中基于PHP的ThinkPHP惯例配置文件代码
  12. mysql 查找某个表在哪个库
  13. HDU - 2181 dfs [kuangbin带你飞]专题二
  14. Python之路(第十七篇)logging模块
  15. docker+efk+.net core部署
  16. UITableView 显示在statusbar 下面
  17. (转)SQL Server 2008无法修改表的解决办法
  18. ubuntu16更新源
  19. 浅谈css中浮动和清除浮动带来的影响
  20. Java中的文件操作(一)RandomAccessFile

热门文章

  1. 配置Tomcat的访问日志格式化输出
  2. SAP的吐槽来源
  3. jar包制作
  4. Phantomjs 一些简单实用
  5. 细数Qt开发的各种坑(欢迎围观)
  6. Fire!(BFS)
  7. Qt的零碎知识
  8. 如何复制DataRow(dataTabel中的行)
  9. `cocos2dx非完整` 开始自己的FW模块
  10. css3 keyframes在yuicompressor下压缩问题