Amazon Elastic Map Reduce for Beginners


Before we begin we should know few terms i.e. S3, EMR, Bucket
Amazon S3 stands for  Amazon Simple Storage Service.
Amazon EMR stands for Amazon Elastic Map Reduce.
Bucket is a term used to store the data. We can place files, folders etc … inside a S3 Bucket.
For more detail about the terms refer to AWS website.
We will create a bucket to store the result. We will use MapReduce to find the number of times word is repeated. For this example we will use the sample py file and data which is already available in aws website as Word Count example.

We assume that you have necessary credential for logging in to AWS Management Console. If not do sign up for AWS.  Select Amazon Simple Storage Service. [ You are required to furnish your credit card details even if you are using free account]
Following Steps shows how to set up the bucket.

1.  From Service Menu, Click on S3.

AWS Management Console
2. It will open S3 Management Console. Click on Create Bucket.
Amazone s3 Create Bucket

3. Provide bucket name and Select the region. Click on Create.

Create Bucket Wizard Amazon S3

4. In All Bucket list, you can see your newly created bucket.

Displaying Bucket Amazon S3

This newly created bucket will be used to hold the data. We can upload data directly in bucket also. We will see how to use MapReduce and store the data in the Bucket. To do so create a cluster using EMR.

5. Click again on  Services, Select EMR ( Elastic Map Reduce ). Click on Create Cluster.

Cluster Amazon EMR

6. In Create Cluster, Click on Go to advanced options.

Amazon EMR Cluster Advance Option

7.  Select Streaming Program in drop down, click on Configure Button

8. In  Name field enter name, In mapper, reducer enter the program and what reducer wants to d
We will use a sample files from amazon
Mapper : s3://elasticmapreduce/samples/wordcount/wordSplitter.py
For details about the py file refer How it Works
Reducer : aggregator
Will add the count for number of words
Input s3 location  : s3://elasticmapreduce/samples/wordcount/input
Output s3 location  : s3://<bucket-name>/output
Click on Add

 

Amazon EMR Mapper Reducer
9. Click on Auto terminate cluster after the last step is completed. Click on Next.
AutoTerminate Cluster Amazon EMR
10. Click Next [ No need to change Hardware Setting ]
11. “Under General Option, for s3 folder enter s3://<bucket-name>/logs. Click Next
12.  Click on Create Cluster.
Create Cluster Complete Screen

 

13. It will display Cluster Detail with states. Click on Cluster List at top.
Provisioning Cluster Amazon EMR

14. Cluster List will be display all clusters. Click on Small triangle button on left side of Cluster Name that we have created.

Viewing Cluster Status Amazon EMR

15. It will display state of cluster Provisioning, Running etc..

Viewing Cluster Status Amazon EMR
16. After 10 to 15 minutes it will display Terminating, All steps completed.
Viewing Cluster Status Terminating Amazon EMR

 

17. To view the results, Click on Services in top menu, Open S3. Select the bucket.
Viewing Bucket

18. Click on output.

Inside Bucket Amazon S3

19. _SUCCESS shows the map reduce worked and results are produced. which is seen as part-0000 etc..

Viewing output inside Bucket Amazon S3

To see the result [ final step ]

20. Right Click on part-00000, Select Download. This will download the file.

Downloading Result from Bucket Amazon S3

21. Open the file in editor [ notepad ], it will displays word and count in columns.

Output from downloaded file Amazon S3 bucket
 Done!

Leave a comment

Your email address will not be published. Required fields are marked *