一、

Spark Streaming 构建在Spark core API之上,具备可伸缩,高吞吐,可容错的流处理模块。

1)支持多种数据源,如Kafka,Flume,Socket,文件等;

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.

2)处理完成数据可写入Kafka,Hdfs,本地文件等多种地方;

DStream:

Spark Streaming对持续流入的数据有个高层的抽像:

It represents a continuous stream of data

a DStream is represented by a continuous series of RDDs,Each RDD in a DStream contains data from a certain interval

Any operation applied on a DStream translates to operations on the underlying RDDs.

什么是RDD?

RDD是Resilient Distributed Dataset的缩写,中文译为弹性分布式数据集,是Spark中最重要的概念。

RDD是只读的、分区的,可容错的数据集合。

何为弹性?

RDD可在内存、磁盘之间任意切换

RDD可以转换成其它RDD,可由其它RDD生成

RDD可存储任意类型数据

二、基本概念

1)add dependency

<dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-streaming_2.11</artifactId>

<version>2.3.1</version>

</dependency>

其它想关依赖查询:

https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.2.0

2)文件作为DStream源,是如何被监控的?

1)文件格式须一致

2)根据modify time开成流,而非create time

3)处理时,当前文件变更不会在此window处理,即不会reread

4)可以调用 FileSystem.setTimes()来修改文件时间,使其在下个window被处理,即使文件内容未被修改过

三、Transform operation

window operation

Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data.

every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream.

在一个时间窗口内的RDD被合并为一个RDD来处理。

Any window operation needs to specify two parameters:

window length: The duration of the window

sliding interval: The interval at which the window operation if performed

四、Output operation

使用foreachRDD

dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently.

CheckPoint概念

Performance Tuning

Fault-tolerance Semantics

最新文章

  1. dg_MeetingRoom 居中显示
  2. Windows下安装Python3.4.2
  3. ThinkPHP - 连贯操作
  4. Qt中文件操作遇到的(变量,容器,结构体)
  5. 微软Visual Studio二十周年:VS2017于3月7日发布
  6. OpenID Connect 是什么?
  7. python3全栈开发-面向对象的三大特性(继承,多态,封装)之继承
  8. .NET redis cluster
  9. document.getElementsByClassName返回的是一个数组
  10. Linux之常用软件-服务
  11. webstorm快捷键 webstorm keymap内置快捷键英文翻译、中英对照说明
  12. 2018.4.27 java容器
  13. bash參考手冊之五(shell变量)续三
  14. hdu5125 树状数组+dp
  15. 虚拟机下Linux系统如何设置IP地址
  16. ELK安装过程
  17. 关于swift语言中导入OC三方类找不到头文件的解决方法
  18. Android O 正式版新功能
  19. 简要谈谈javascript bind 方法
  20. Java-IO读写文件简单操作

热门文章

  1. Connection parameters are correct , SSL not enabled
  2. [contest 781] 9.6
  3. NOIP2003加分二叉树
  4. [LeetCode] 231. Power of Two ☆(是否2 的幂)
  5. Java反序列化修复方案
  6. static与全局变量区别
  7. [IOS微信] Unicode码 转化为字符串
  8. AI工具(星形工具)(光晕工具)(移动复制)(柜子绘制)5.12
  9. AI新建文件可以新建多个画板5.2
  10. linux用户管理 用户和用户组信息