Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing
RDD is read-only, partitioned collection of records that can only be created through deterministic operations on either data in stable storage or other RDDs.
RDDs are best suited for batch analytics that apply the same operation to all elements of a dataset.
launch a computation to return a value to the program or write data to external storage
Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.
- iterator(p, paarentIters)
- a set of partitions
- a set of dependencies on parent RDDs
- a function for computing the dataset based on its parents
- metadata about its partitioning scheme and data placement
Compare to Distribted Shared Memory
- bulk writes
- efficient fault tolerance
- immutable nature lets a system mitigate slow nodes by running backup copies of slow tasks
- bulk operations can be scheduled based on data locality
- degrade gracefully
- assign tasks to nodes based on data locality using delay scheduling
- materialize intermediate records on the nodes holding parent partitions for wide dependencies
- class shipping
- modified code generation
Support for checkpointing
- REPLICATE flag