You can use Amazon Elastic MapReduce (EMR) to run batch and streaming data processing jobs on Amazon EMR clusters. To create an EMR cluster that meets your needs, you should consider the size, cost, and capabilities of the hardware you want to use for your cluster. In this post we will discuss how to create an EMR cluster that is scaled with Elastigroup scaling policies and ready for production workloads.
What is EMR Cluster
An EMR Cluster is a collection of EC2 instances that are managed by Elastigroup. You can use it to run your workloads in an Elastic MapReduce (EMR) environment, which provides scalability and fault tolerance.
An EMR cluster can be created on different hardware configurations, such as c3.8xlarge instances in us-east-1 region with Spot blocks enabled or c4.2xlarge instance in ap-southeast-1 region without Spot blocks enabled.
Step 1: Open The EMR Creation Wizard
- Open the AWS Management Console, and choose EMR.
- Choose Create cluster to launch the EMR Creation Wizard.
- Select EMR cluster as your type of Amazon Elastic MapReduce (EMR) cluster, and then choose Next: Configure Master Parameters.
Step 2: Add Elastigroup Description
Add a description to your cluster. It is not required, but it helps users identify your cluster and make intelligent decisions about what to do with it. For example, if you are running an EMR cluster in production, you might describe the purpose of the cluster as follows:
- This is my production EMR cluster where I run clinical trials for our pharmaceutical company.
Step 3: Configure Strategy & Compute
The last step is to configure the EMR cluster settings. AWS offers several options here, such as:
- Execution Type – Choose whether you want to use Spot or On-Demand instances.
- Instance Type – Choose a specific instance type based on your compute needs. For example, if you want dedicated CPUs and memory, choose a c5 instance type rather than an m3 instance type (which uses shared resources). To get started with a more powerful machine, try the p2 or c4 family of computing nodes.
- Number Of Instances – Set how many instances are needed for your task; this could be 1 or 100+.
- Bootstrap Action – Select an action for when new nodes are provisioned by Amazon EMR: shutdown current tasks before bootstrapping new nodes? Shut down existing tasks when all existing nodes fail? Run no action at all? You can also set up automatic scaling policies that will manage increasing or decreasing capacity as needed; see [Amazon EMR Service Limits](https://docs.aws.amazon.com/emr/latest/ug/service-limits-and-constraints.html) for details about these different options!
Step 5: Review and Create
- Review and Create
Now that you’ve created your cluster, review it before creating it. You will see something similar to the following:
- Cluster name is set
- Data nodes are not set to use a shared IP address
- The management node is set as the default gateway for all data nodes (this can be useful for automatic failover)
Step 4: Scaling Policies
- Scaling policies help you to scale your cluster, i.e., run multiple instances of your application on the same virtual machine (VM). One of the most important tasks in running a dynamic containerized service is managing scaling. You can configure scaling policies to automatically scale up or down your application as needed by using Kubernetes mechanisms such as Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA).
- You can define multiple types of scaling policies:
CPU-based: The number of pods is based on the average CPU utilization over a specified period of time.
Memory-based: The number of pods is based on the average memory utilization over a specified period of time.
Create a Wrapped EMR Cluster
To create a cluster using the EMR wizard:
- Log in to the AWS Management Console, and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
- On the dashboard, choose Launch Instance.
- For Cluster Type, choose “EMR (Elastic MapReduce)”. This will allow you to launch instances that can be used with Hadoop applications like Cassandra and Spark SQL (more on these later). If you don’t have any of those applications running on your cluster, you can use this option too because it won’t affect anything else about the instance types that get launched or how they’re configured once they’ve been set up with EMR metadata; if nothing else were available then it would cost less than other available options like spot instances or preemptible instances which are maintained by third-party vendors who may not always be operating at capacity levels required by customers wishing to run them for long periods at a time but still want some level of utility guarantee over their usage fees since these costs add up quickly when budgets are tight – especially so if companies have several dozen users trying out new skillsets as part of corporate training programs!
Configure Elastigroup’s Scaling Policies
You can set scaling policies for Elastigroup in Kubernetes. Scaling up and down are done automatically based on these policies, which you can configure in the Elastigroup’s user interface:
- Scale up the cluster when CPU utilization is high.
- Scale down the cluster when CPU utilization is low.
- Scale down the cluster when it’s idle (the average CPU utilization is below a certain threshold).
- Scale up the cluster when it’s busy (the average CPU utilization is above a certain threshold).
Run a job in your cluster
To run a job in your cluster, you will need to create a Docker container that contains your code. You can use any language that is supported by Docker and is able to run on the node types available in the cluster (see “Node Types” below). To create a container, you will need to install Docker on your local machine first.
The following command uses this image: [image_name]/[version]:latest
$ docker pull [image_name]/[version]:latest
You can create an EMR cluster on a wide range of hardware configurations.
You can create an EMR cluster on a wide range of hardware configurations. You can configure your cluster to use different hardware configurations depending on your needs. For example, if you want to run advanced analytics jobs, you might consider using GPU-based nodes and SSDs in the cluster. If your workload is mostly long-running and CPU-intensive jobs, then standard x86 servers with local storage will be sufficient for your needs.
We hope that this guide was helpful in getting you up and running with EMR clusters. If you have any questions, please reach out to us at.[hurrytimer id=”735″]