If you haven't read the Hadoop Page, please refer to it, before continuing.
Why it is needed?
Iterative jobs require a lot of I/O operations where I/O operations are slow. Meaning you are limited to MapReduce when you stop at each iteration of the MapReduce process. One way to speed this up is storing it in memory instead of the disk. Spark sets itself aside by provided the opportunity to use in-memory storage as well as disk storage.
Spark also has the advantage of there transformations other than the Map and Reduce structure. This can take advantage of interactive and streaming instead of just batching operations.
Another feature, Spark allows for is lazy evaluations, which is when you code and the code doesn't need to be implemented immediately.
RDD
The key concept for Apache's Spark, is RDD or Resilient Distributed Dataset, which has an immutable lookup. The way it works, is you enter your data info Python'c collection, and the RDD will split the data into partitions, the same why Map would do in Hadoop. This means the more partitions, the more parallelism the code will have. Then you apply transformation to the RDD.
Four basic Transformations:
- Map, takes each element of a RDD and transforms it - Filter, takes elements out that do not apply to the application - Distinct, finds the unique values and reduces the information set down to those values - Flat map, which is similar to map, however it will output in one seamless list. Grouped by Transformation - Computes aggregates
Where does the code run? Locally in the driver for most instructions But it can run in the driver and the executors depending on the instruction - It distributes only at the executors. When it runs in parallel it is better to run at the executors, since they contain more memory Executors are also able to deal with huge partitions of code (Remember: Transformations happen on the executor whereas most of the python code happens in the driver. So be aware of where things are running when performing certain functions.)
Random Syntax The command: outer - works with distinct items in the table + - works as a union operation collect - do not use, could overload the system take - use instead of collect cache - use for data frames that are used frequently