Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

RDD is read-only, partitioned collection of records that can only be created through deterministic operations on either data in stable storage or other RDDs.

RDDs are best suited for batch analytics that apply the same operation to all elements of a dataset.


lazy operations

  • map
  • filter
  • flatMap
  • sample
  • groupByKey
  • reduceByKey
  • union
  • join
  • cogroup
  • crossProduct
  • mapValues
  • sort
  • partitionBy


launch a computation to return a value to the program or write data to external storage

  • count
  • collect
  • save
  • reduce
  • lookup


  • persistence
  • partitioning

Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.


  • partitions()
  • preferredLocations
  • dependencies()
  • iterator(p, paarentIters)
  • partitioner()


  • a set of partitions
  • a set of dependencies on parent RDDs
    • narrow
    • wide
  • a function for computing the dataset based on its parents
  • metadata about its partitioning scheme and data placement

Compare to Distribted Shared Memory


  • bulk writes


  • efficient fault tolerance
  • immutable nature lets a system mitigate slow nodes by running backup copies of slow tasks
  • bulk operations can be scheduled based on data locality
  • degrade gracefully


Job secheduling

  • DAG
  • assign tasks to nodes based on data locality using delay scheduling
  • materialize intermediate records on the nodes holding parent partitions for wide dependencies

Interpreter integration

  • class shipping
  • modified code generation

Memory management

  • LRU

Support for checkpointing

  • REPLICATE flag