Parquet是面向分析型业务得列式存储格式

编程方式加载数据

代码示例

package wujiadong_sparkSQL

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/2/3.
*/
object ParquetLoadData {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("ParquetLoadData")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val usersDF = sqlContext.read.parquet("hdfs://master:9000/student/2016113012/spark/users.parquet")
usersDF.registerTempTable("t_users")
//查询name
val usersNameDF = sqlContext.sql("select name from t_users")
//转换成RDD并执行相关操作
usersNameDF.rdd.map(row => "Name:"+row(0)).collect().foreach(username => println(username)) } }

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.ParquetLoadData  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/02/03 14:36:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/03 14:36:02 INFO Slf4jLogger: Slf4jLogger started
17/02/03 14:36:03 INFO Remoting: Starting remoting
17/02/03 14:36:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:40895]
17/02/03 14:36:07 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
17/02/03 14:36:20 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/02/03 14:36:21 INFO CodecPool: Got brand-new decompressor [.snappy]
Name:Alyssa
Name:Ben
17/02/03 14:36:21 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/02/03 14:36:21 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

自动分区

hadoop@master:~$ hadoop fs -mkdir /student/2016113012/spark/users
hadoop@master:~$ hadoop fs -mkdir /student/2016113012/spark/users/gender=male/
hadoop@master:~$ hadoop fs -mkdir /student/2016113012/spark/users/gender=male/country=us
hadoop@master:~/wujiadong$ hadoop fs -put users.parquet /student/2016113012/spark/users/gender=male/country=us
hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.ParquetPartitionTest  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/02/03 15:13:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/03 15:13:43 INFO Slf4jLogger: Slf4jLogger started
17/02/03 15:13:43 INFO Remoting: Starting remoting
17/02/03 15:13:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:37709]
17/02/03 15:13:46 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
17/02/03 15:13:59 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
17/02/03 15:13:59 INFO CodecPool: Got brand-new decompressor [.snappy]
+------+--------------+----------------+------+-------+
| name|favorite_color|favorite_numbers|gender|country|
+------+--------------+----------------+------+-------+
|Alyssa| null| [3, 9, 15, 20]| male| us|
| Ben| red| []| male| us|
+------+--------------+----------------+------+-------+ 17/02/03 15:14:00 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/02/03 15:14:00 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 自动推断出了性别和国家

合并元数据

1)读取parquet文件时,将数据源的选项mergeSchema,设置为true

2)使用SQLContext.setConf()方法,将spark.sql.parquet.mergeSchema参数设置为true

案例:合并学生的基本信息和成绩的元数据



最新文章

  1. windows计划任务+批处理文件实现oracle数据库的定时备份与恢复
  2. CentOS 6.5 安装Python 3.5
  3. C++中不同数据类型的互相转换
  4. Appium客户端
  5. FastReport使用一——简介
  6. Scrapy简介
  7. POJ-1088 滑雪 (包含部分自用测试数据)
  8. Docker系列(八)Kubernetes介绍
  9. 04747_Java语言程序设计(一)_第4章_数组和字符串
  10. JavaScript 的 作用域
  11. mysql索引优化建议
  12. java基础-01基本概念
  13. Linux基础(一)
  14. Linux安装常见问题
  15. OO第一单元小结
  16. Android线程
  17. JavaScript面向对象之闭包的理解
  18. js 毫秒转换为标准时间
  19. CF700E:Cool Slogans(SAM,线段树合并)
  20. [转]mac上安装android sdk

热门文章

  1. Android 触摸及手势操作GestureDetector
  2. 智力大冲浪(洛谷P1230)
  3. Flutter入门之有状态组件
  4. delphi,数据类型,字符、浮点、整数、数组
  5. 微信公众号非善意访问的限制 php curl 伪造UA
  6. BaseAction 类
  7. 【我的Android进阶之旅】Android插件化开发学习资料
  8. 我的Android进阶之旅------>Java全角半角的转换方法
  9. Tomcat 自定义默认网站目录
  10. Web Service简单demo