Goals While Moore’s Law has held true for several decades, computer scientists are starting to see the end of it’s reign. The amount of data that needs to be processed in order for self driving cars and other artificial intelligence projects to flourish, won’t be sustained by reducing the size of processes on a core, and won’t be sustained by adding more cores. Instead, programmers need to start looking into parallel programming. Therefore, the goal of the programmer is to maximize core computation and keep the system stable. Thus synchronization is the biggest issue.
Purpose of Hadoop Hadoop’s purpose is to allow generous amounts of information to be processed by designing an algorithm that allows a commodity of computers to work in parallel.
How is data distributed? In a cluster, data remains in the nodes of the cluster as it is be loaded in. HDFS splits large data files into smaller files which then can be manage by different nodes in the cluster.
What is the Hadoop Distributed File System?
MAPREDUCE: Hadoop limits the amount of communication between processes, which makes for a more stable framework. Programs must be written in MapReduce Format in order for Hadoop to work.
Introduction to Hadoop
It is a solution for big data challenges, since too much time is being used to calculate the amount of info that is being passed through.Hadoop preprocesses the data, thereby partition the data and making it faster. The master program decides how to split the data amount the worker nodes. During execution, the worker nodes are only able to communicate with the master node and not with each other. The master also prepares the schedule, it determines when all processes are computed. The HDFS is a filesystem written in top of the native filesystem and is set as a write once file, not random.
The master node keeps track of which blocks make up the file, and where they are located, this is often referred to as the Namenode or Metadata. Flume which is another software collects thee data and throw it into Hadoop. The client asked Namenode for data, and the Namenode returns the blocks and where the blocks are located within each node.
Note: Hadoop must be a system that can handle partial failure all the while keeping its consistency.
Hadoop is mainly used for programs that are "embarrassingly parallelize-able", which is when different nodes are independent of one another and therefore can be broken down into blocks. These blocks can then run simultaneously, this is when MapReduce come in handy, but more on that in a second.
The Map portion of MapReduce receives an input of a key value pair. It then outputs a list of key value pairs.
For example: Input: 2, The cat sat on the mat. Output: (The, 1), (cat, 1), (sat, 1), (on,1), (the, 1), (mat,1)
The Reducer All values with a particular key are given to the same reducer. This step is known as the shuffle and sort step. The reducer then reeves an out key, and a list of intermediate values and outputs a list of out values. This output is dumped into the HDFS.
For example: Input: [the, {1, 1}] Output: [the, 2]
The hope for MapReduce is that as a program scales up, it takes linear time.
Four Major Components:
- The Namenode or master nodes or Metadata, feeds into the program. - TaskTracker or workers, are responsible for monitoring map and reduce functions - Datanodes, store the HDFS data blocks - JobTracker - determines the execution plan.
When is MapReduce useful?
Once again, it is useful for "embarrassingly parallelize-able" problems or problems which are complex or require a lot of data. Often times you will use MapReduce for batch processing.