【Question 01】

  When converting Tweets info to csv file, commas in the middle of data (i.e. location: Sydney, NSW) can make a mistake of the csv file (creaing more columns).

  The solution is to add double quotation marks on both sides of the content, like this:

fo.write("\"" + str(tweet["user"]["location"]) + "\"")

【Question 02】

  When open csv file with Excel, sometimes it will show messy code, but it can show well with Notepad.

  ref:csv 文件打开乱码,有哪些方法可以解决?

  One solution is opening this file with notepad++.

  Another solution is adding codes at the beginning of the writing file, like this:

fo = open(r"D:\Twitter Data\Data\test\tweets.csv", "w")
fo.write("\ufeff")

【Question 03】

  Text contents contain carriage return, double quotation marks, single quotation marks. Those info will make mistakes when creating csv file.

  So we should replace those characters with space or nothing, like this:

text = str(tweet["text"])
text = text.replace("\n", " ")
text = text.replace("\"", "")
text = text.replace("\'", "")
fo.write("\"" + text + "\"")

  Including tweet["user"]["location"] and tweet["text"], for these two attributes, user can write whatever they want, so it's easy to make mistakes.

【Question 04】

  After converting Tweets to csv file, but I can't open this file by pandas.read_csv(). The reason is there must be some problems in those data. Since there are about more than 100000+ rows of this csv file, how can I locate the error line?

  Solution is coverting the first 10000 rows, if there are not errors, and then converting the next 10000 rows. If error occurs, trying to narrow the range of numbers, like error occurs between 20000 to 30000, we can change the range of numbers with 20000 to 25000. Using this method several times, we can locate the error line and find the real problems. For this spicific case, most problems are about contents include carriage return, double quotation marks, etc.

  Codes like this:

...

count = 0
or line in tweets_file:
try:
count += 1
if (count < 10000):
continue
... if (count > 20000):
break
except:
continue
...

最新文章

  1. CanvasWebgl项目介绍
  2. C#命名规则和编码规范
  3. OpenCV图像细化的一个例子
  4. F2工作流引擎之-纯JS Web在线可拖拽的流程设计器(八)
  5. sdoi 2009 &amp; 状态压缩
  6. linux上java路径设置
  7. 三、android中Handle类的用法
  8. SQL Server 执行计划
  9. UIButton上使用UIEdgeInsetsMake让title跟图片对齐
  10. Android中Chronometer 计时器和震动服务控件
  11. SQL数据库增删改查基本语句
  12. Webserver管理系列:1、安装Windows Server 2008
  13. how to write a struct to a file directly?
  14. [转载] Cassandra入门 框架模型 总结
  15. python使用requests发送application/x-www-form-urlencoded请求数据
  16. nuxt跨域
  17. PHP如何判断一个数组是一维还是多维
  18. .bat文件调用java类的main方法
  19. Android : 网络adb配置及有线端口占用解决方法
  20. Java编程的逻辑 (13) - 类

热门文章

  1. C#随机数Random
  2. 置换及P&#243;lya定理
  3. shellshock溢出攻击
  4. HDU - 4352 - XHXJ&#39;s LIS(数位DP)
  5. 用LinkedList和ArrayList实现自定义栈的异同
  6. MAC OSX下终端通过NTLM验证,通过代理上网(花了一天时间才解决这个)
  7. php 数组的计算
  8. 04_(终结版)通过App实现对数据库的增删改
  9. 2019/7/18 --1.&lt;%@ include file=&quot;&quot;%&gt;与&lt;jsp:include page=&quot;&quot;/&gt;两种方式的作用
  10. 【CSS】知识笔记