Spark 大数据平台
2024-09-28 06:23:48
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.

Spark Components | VS. | Hadoop Components |
---|---|---|
Spark Core | <------> | Apache Hadoop MR |
Spark Streaming | <------> | Apache Storm |
Spark SQL | <------> | Apache Hive |
Spark GraphX | <------> | MPI(taobao) |
Spark MLlib | <------> | Apache Mahout |
BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to +, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.
Two key ideas:
- An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time
- A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements.
Why spark is fast:
- in-memory computing
- Directed Acyclic Graph (DAG) engine, compiler can see the whole computing graph in advance so that it can optimize it. Delay Scheduling
Resilient Distributed Dataset
- A list of partitions
- A function for computing each split
- A list of dependencies on other RDDs
- Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
- Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Storage Strategy
class StorageLevel private(
private var useDisk_ : Boolean,
private var useMemory_ : Boolean,
private var deserialized_ : Boolean,
private var replication_ : Int = 1)
val MEMORY_ONLY_ = new StorageLevel(false, true, true)
RDD, transformation & action
lazy evaluation

最新文章
- js删除数组指定元素
- Spring
- HBase与MongDB等NoSQL数据库对照
- Java——关于String(字符串)
- 转自微软内部资料:编写高性能 Web 应用程序的 10 个技巧
- C语言头文件的使用与写法
- 201521123076《java程序设计》第四次总结
- Luogu3527:[POI2011]MET-Meteors
- Lintcode248 Count of Smaller Number solution 题解
- 【CH4302】Interval GCD
- LG3211 [HNOI2011]XOR和路径
- Java内存模型-final域的内存语义--没明白,预留以后继续理解
- 2009 Putnam Competition B3
- Solaris 11配置IPS安装系统包(类似linux中的yum源)
- 20170921xlVBA_SQL蒸发循环查询2
- NPOI将DataGridView中的数据导出+导出Chart图表图片至Excel
- 【Kafka】Kafka-分区数-备份数-如何设置-怎么确定-怎么修改
- devart 放大招了
- JavaScript总结(三)
- CentOS 7 中 hostnamectl 的使用