Detailed explanation of Spark Yarn scheduler Scheduler

1. Selection of scheduler

There are three schedulers to choose from in Yarn: FIFO Scheduler, Capacity Scheduler, and FairScheduler.

FIFO Scheduler arranges applications into a queue in the order of submission. This is a first-in, first-out queue. When allocating resources, the top one in the queue is given first. The application allocates resources, and then allocates resources to the next one after the requirements of the top application are met, and so on.

FIFO Scheduler is the simplest and easiest to understand scheduler and does not require any configuration, but it is not suitable for shared clusters. Large applications may occupy all cluster resources, causing other applications to be blocked. In a shared cluster, it is more suitable to use Capacity Scheduler or Fair Scheduler. Both schedulers allow large and small tasks to obtain certain system resources while submitting them.

The following "Yarn Scheduler Comparison Chart" shows the differences between these schedulers. As can be seen from the chart, in the FIFO scheduler, small tasks will be overwhelmed by large tasks. Task blocking.

For the Capacity scheduler, there is a special queue for running small tasks, but setting up a queue specifically for small tasks will pre-occupy certain cluster resources, which results in The execution time of large tasks will lag behind when using the FIFO scheduler.

In the Fair scheduler, we do not need to occupy certain system resources in advance. The Fair scheduler will dynamically adjust system resources for all running jobs. As shown in the figure below, when the first large job is submitted, only this job is running, and it obtains all cluster resources; when the second small task is submitted, the Fair scheduler will allocate half of the resources to this small task. , allowing these two tasks to share cluster resources fairly.

It should be noted that in the Fair scheduler below, there will be a certain delay from the submission of the second task to the acquisition of resources, because it needs to wait for the release of the first task Container occupied. After the small task is completed, the resources it occupied will be released, and the large task will obtain all system resources. The final effect is that the Fair scheduler achieves high resource utilization and ensures that small tasks are completed in time.

Yarn scheduler comparison chart:

2. Configuration of Capacity Scheduler (container scheduler)

< p>

2.1 Introduction to container scheduling

The Capacity scheduler allows multiple organizations to share the entire cluster, and each organization can obtain a part of the cluster's computing ability. By allocating a dedicated queue to each organization, and then allocating certain cluster resources to each queue, the entire cluster can provide services to multiple organizations by setting up multiple queues. In addition, the queue can be divided vertically, so that multiple members within an organization can share the queue resources. Within a queue, resource scheduling adopts the first-in-first-out (FIFO) strategy. .

From the above picture, we already know that a job may not use the resources of the entire queue. However, if there are multiple jobs running in this queue, if the resources of this queue are sufficient, then they will be allocated to these jobs. What if the resources of this queue are not enough? In fact, the Capacity scheduler may still allocate additional resources to this queue. This is the concept of "elastic queue" (queue elasticity).

In normal operation, the Capacity scheduler will not force the release of Containers. When a queue's resources are not enough, this queue can only obtain Container resources released by other queues. Of course, we can set a maximum resource usage for the queue to prevent this queue from occupying too many idle resources and causing other queues to be unable to use these idle resources. This is where the "elastic queue" needs to be weighed.

2.2 Container scheduling configuration

Suppose we have the following levels of queues:

root

├── prod

└── dev

< /p>

├── eng

└── science

The following is a simple Capacity scheduler The configuration file is named capacity-scheduler.xml. In this configuration, two subqueues prod and dev are defined under the root queue, accounting for 40% and 60% of the capacity respectively. It should be noted that the configuration of a queue is specified through the attribute yarn.sheduler.capacity.., which represents the inheritance tree of the queue, such as the root.prod queue, which generally refers to capacity and maximum-capacity.

We can see that the dev queue is divided into two sub-queues of the same capacity, eng and science. The maximum-capacity attribute of dev is set to 75%, so even if the prod queue is completely idle, dev will not occupy all cluster resources. In other words, the prod queue still has 25% of available resources for emergency use. We noticed that the maximum-capacity attribute is not set for the eng and science queues, which means that jobs in the eng or science queue may use all the resources of the entire dev queue (up to 75% of the cluster). Similarly, since prod does not set the maximum-capacity attribute, it may occupy all the resources of the cluster.

In addition to configuring the queue and its capacity, the Capacity container can also configure the maximum number of resources that a user or application can allocate, how many applications can be run simultaneously, and the ACL authentication of the queue. wait.

2.3 Queue settings

Regarding the queue settings, it depends on our specific application. For example, in MapReduce, we can specify the queue to use through the mapreduce.job.queuename attribute. If the queue does not exist, we will receive an error when submitting the task. If we do not define any queue, all applications will be placed in a default queue.

Note: For the Capacity scheduler, our queue name must be the last part in the queue tree, it will not be recognized if we use a queue tree. For example, in the above configuration, it is okay for us to use prod and eng as queue names, but it is invalid if we use root.dev.eng or dev.eng.

3. Fair Scheduler configuration

3.1 Fair scheduling

< /p>

The design goal of the Fair scheduler is to allocate fair resources to all applications (the definition of fairness can be set through parameters). The "Yarn Scheduler Comparison Chart" above shows fair scheduling of two applications in one queue; of course, fair scheduling can also work across multiple queues. For example, suppose there are two users A and B, and they each have a queue.

When A starts a job and B has no tasks, A will obtain all cluster resources; when B starts a job, A's job will continue to run, but after a while, the two tasks will each obtain half of the cluster resources. If B starts a second job at this time and other jobs are still running, it will share the resources of B's ??queue with B's first job, that is, B's two jobs will be used for four points. One of the cluster resources, while A's job is still used for half of the cluster resources. The result is that the resources are eventually shared equally between the two users. The process is shown in the figure below:

3.2 Enable Fair Scheduler

The use of the scheduler is configured through yarn-site.xml The yarn.resourcemanager.scheduler.class parameter in the file is configured, and the Capacity Scheduler scheduler is used by default. If we want to use the Fair scheduler, we need to configure the fully qualified name of the FairScheduler class on this parameter: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.

3.3 Queue configuration

The configuration file of the Fair scheduler is located in the fair-scheduler.xml file under the classpath. , this path can be modified through the yarn.scheduler.fair.allocation.file property. If there is no such configuration file, the distribution strategy adopted by the Fair scheduler is similar to that introduced in Section 3.1: the scheduler will automatically create a queue for the user when they submit the first application. The name of the queue is the user name, and all Applications will be assigned to the corresponding user queue.

We can configure each queue in the configuration file, and configure the queue hierarchically like the Capacity scheduler. For example, refer to capacity-scheduler.xml to configure fair-scheduler:

The hierarchy of queues is implemented through nested elements. All queues are children of the root queue, even if we are not allocated to elements. In this configuration, we divide the dev queue into two queues: eng and science.

The queue in the Fair scheduler has a weight attribute (this weight is the definition of fairness), and this attribute is used as the basis for fair scheduling. In this example, when the scheduler allocates cluster resources 40:60 to prod and dev, it is considered fair. The eng and science queues have no defined weights and will be evenly distributed. The weight here is not a percentage. We replace the above 40 and 60 with 2 and 3 respectively, and the effect is the same. Note that for queues that were automatically created on a per-user basis without a profile, they still have a weight and a weight value of 1.

Each queue can still have different scheduling strategies. The queue's default scheduling policy can be configured through the top-level element. If not configured, fair scheduling is used by default.

Although it is a Fair scheduler, it still supports FIFO scheduling at the queue level. The scheduling strategy of each queue can be overridden by its internal elements. In the above example, the prod queue is designated to use FIFO for scheduling, so tasks submitted to the prod queue can be executed sequentially according to FIFO rules. It should be noted that the scheduling between prod and dev is still fair scheduling, and the same is true for eng and science.

Although not shown in the above configuration, each queue can still be configured with the maximum and minimum resource occupancy and the maximum number of runnable applications.

3.4 Queue settings

The Fair scheduler uses a rule-based system to determine where the application should be placed. queue. In the example above, the element defines a list of rules, each of which will be tried one by one until a match is found. For example, the first rule specified in the above example will put the application into the queue it specifies. If the application does not specify a queue name or the queue name does not exist, it means that this rule does not match, and then the next rule will be tried. The primaryGroup rule will try to place the application in the queue named after the Unix group name of the user. If there is no such queue, the queue will not be created and the next rule will be tried. When all the previous rules are not satisfied, the default rule is triggered and the application is placed in the dev.eng queue.

Of course, we do not need to configure queuePlacementPolicy rules, and the scheduler adopts the following rules by default:

The above rules can be summed up in one sentence, unless the queue is accurately Definition, otherwise a queue will be created with the user name and queue name.

There is also a simple configuration strategy that can put all applications into the same queue (default), so that all applications can share the cluster equally. Not between users. The definition of this configuration is as follows:

To achieve the above functions, we can also directly set yarn.scheduler.fair.user-as-default-queue=false without using the configuration file, so that the application will be placed in default queue instead of individual username queues. In addition, we can also set yarn.scheduler.fair.allow-undeclared-pools=false so that users cannot create queues.

3.5 Preemption

When a job is submitted to an empty queue in a busy cluster, the job does not It will be executed immediately, but will block until the running job releases system resources. In order to make the execution time of submitted jobs more predictable (the waiting timeout can be set), the Fair scheduler supports preemption.

Preemption allows the scheduler to kill containers that occupy more than their share of resource queues, and these container resources can be allocated to queues that should enjoy these share resources. It should be noted that preemption will reduce the execution efficiency of the cluster because terminated containers need to be re-executed.

The preemption function can be enabled by setting a global parameter yarn.scheduler.fair.preemption=true. In addition, there are two parameters used to control the expiration time of preemption (these two parameters are not configured by default, and at least one needs to be configured to allow preemption of the Container):

- minimum share preemption timeout

- fair share preemption timeout

If the queue does not obtain the minimum share preemption timeout within the time specified by minimum share preemption timeout To ensure resource guarantee, the scheduler will seize containers. We can configure this timeout for all queues through the top-level element in the configuration file; we can also configure elements within the element to specify the timeout for a certain queue.

Similarly, if the queue does not obtain half of the equal resources within the fair share preemption timeout specified time (this ratio can be configured), the scheduler will preempt containers. This timeout can be configured separately for all queues and a certain queue through top-level elements and element-level elements. The ratio mentioned above can be configured through (configure all queues) and (configure a certain queue), and the default is 0.5.