Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.

Spark Components VS. Hadoop Components
Spark Core <------> Apache Hadoop MR
Spark Streaming <------> Apache Storm
Spark SQL <------> Apache Hive
Spark GraphX <------> MPI(taobao)
Spark MLlib <------> Apache Mahout

BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to +, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.
Two key ideas:

  • An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time
  • A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements.

Why spark is fast:

  • in-memory computing
  • Directed Acyclic Graph (DAG) engine, compiler can see the whole computing graph in advance so that it can optimize it. Delay Scheduling

Resilient Distributed Dataset

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Storage Strategy

class StorageLevel private(
private var useDisk_ : Boolean,
private var useMemory_ : Boolean,
private var deserialized_ : Boolean,
private var replication_ : Int = 1) val MEMORY_ONLY_ = new StorageLevel(false, true, true)

RDD, transformation & action

lazy evaluation

最新文章

  1. js删除数组指定元素
  2. Spring
  3. HBase与MongDB等NoSQL数据库对照
  4. Java——关于String(字符串)
  5. 转自微软内部资料:编写高性能 Web 应用程序的 10 个技巧
  6. C语言头文件的使用与写法
  7. 201521123076《java程序设计》第四次总结
  8. Luogu3527:[POI2011]MET-Meteors
  9. Lintcode248 Count of Smaller Number solution 题解
  10. 【CH4302】Interval GCD
  11. LG3211 [HNOI2011]XOR和路径
  12. Java内存模型-final域的内存语义--没明白,预留以后继续理解
  13. 2009 Putnam Competition B3
  14. Solaris 11配置IPS安装系统包(类似linux中的yum源)
  15. 20170921xlVBA_SQL蒸发循环查询2
  16. NPOI将DataGridView中的数据导出+导出Chart图表图片至Excel
  17. 【Kafka】Kafka-分区数-备份数-如何设置-怎么确定-怎么修改
  18. devart 放大招了
  19. JavaScript总结(三)
  20. CentOS 7 中 hostnamectl 的使用

热门文章

  1. C如何使用内存
  2. iOS - 网络 - NSURLSession
  3. 剑指Offer23 二叉树中和为sum的路径
  4. hdu 3909 数独扩展
  5. git之添加ssh
  6. html使用空格对齐文本(&amp;nbsp;&amp;emsp;&amp;ensp;)
  7. 关于arraylist.remove的一些小问题。
  8. 用c#写一个json的万能解析器
  9. Unity3d 引擎原理详细介绍、Unity3D引擎架构设计
  10. 很全的corel图像分类,场景识别图像库