Load Balancing

Load balancing lets you get the most out of your Job Servers by evenly distributing the load across systems. By default, RunMyJobs uses the Job count of Job Servers to distribute the load. However, this is not very efficient when different Jobs use resources differently, because resource-intensive Jobs count as much as lightweight processes. Fortunately, RunMyJobs offers several other load balancing strategies that can be used on their own or in combination.

Generic Load Balancing Using Queues

Generic Queue-based load balancing can be used for any type of workload. This approach is typically used to prevent a weak system from slowing down or offloading work onto a server that has a more important role in another Queue.

When you create a Queue or Queue Provider, you can specify an Execution Size that determines the number of concurrent Jobs allowed in the Queue. If the Queue serves several Job Servers, RunMyJobs will send Jobs to the Job Server with the smallest number of Jobs in status Running. You can influence this by setting an Execution Size on the Queue Provider associated with a Job Server. RunMyJobs will ignore the Job Server while the number of its running Jobs equals the Queue Provider's Execution Size.

Note: You can include Jobs in status Waiting in your Execution Size, but these typically do not consume any resources on the remote system.

Using Load Factors

You can use load factors to specify custom metrics for evaluating the load of a Job Server.

  • Multiplier: The relative weight of a specific load factor
  • Threshold: The maximum value allowed for this load factor, when this is reached, the Job Server is set to Overloaded
  • Monitor Value: The unit to use (CPU time, Page rate, Job Server check value, jmonitor value)
  • Load Threshold: The maximum allowed value of the sum of all load factors (multiplier * Monitor Value). When this is reached, a Job Server goes into status Overloaded.

Note: Load Threshold is Job Server-specific, not load factor-specific. There is only one Load Threshold per Job Server. On Job Servers with one load factor only, Load Threshold should be set to the same or higher value as Threshold, because t is only used when you want to take the combined effects of multiple load factors into account.

Note: The Threshold and Load Threshold values must take the Multiplier values into account.

OS Metric Load Balancing with Monitor Values

There are several options for load-balancing OS processes across Platform Job Servers.

  • Job count load balancing: The default option if no load factors are defined for the Job Servers. The system sends new Job to the Job Server with the smallest number of running Jobs and Workflows.
  • OS metric load balancing: Uses near real-time monitoring data from the Platform Agents to decide where to run the Job or Workflow.
  • jmonitor load balancing: Uses near real-time monitoring data generated by jmonitor.

Job Count

The default load balancing technique uses concurrent Jobs as the metric for balancing the load and is enabled by default.

The Load and LoadThreshold monitor values for the Job Server have the following values:

  • /System/Job Server/${PSName}/Performance/Load: The number of Jobs the Job Server is currently processing.
  • /System/Job Server/${PSName}/Performance/LoadThreshold: By default, the maximum number of Jobs allowed to run simultaneously.

OS Metric

This type of load balancing requires a Platform Agent on each server and is typically used for Platform Agent workload. This load balancing uses load factors and a threshold for each Job Server. Two commonly used load factors are CPU usage ( CPUBusy ) and page rate ( PageRate, the rate at which pages are sent to and retrieved from the swap area), however, you can create Job Server checks to create your own criteria as well.

The Load and LoadThreshold monitor values for a Job Server have the following values:

  • /System/ProcessServer/${process_server}/Performance/Load: Representation of the load factors as configured.
  • /System/ProcessServer/${process_server}/Performance/LoadThreshold: Maximum load specified on the Job Server's Load Factors tab.

Example 1

Assume that two Job Servers accept Jobs that require a specific resource. Job Server prd5.example.com (MSLN_UNIXS5) is more powerful than Job Server prd7.example.com (MSLN_UNIXS7), so you want to run 1.5 times more Jobs on pr5.example.com. Consequently, prd7.example.com should run a maximum of 50 concurrent Jobs, and prd5.example.com should be run a maximum of 75 concurrent Jobs.

To accomplish this, specify the following load factors on the Job Servers:

Job Server Multiplier Threshold Monitor Value Load Threshold
prd5.example.com 2 150 /System/Job Server/MSLN_UNIXS5/Performance/Load 150
prd7.example.com 3 150 /System/Job Server/MSLN_UNIXS7/Performance/Load 150

Example 2

Assume the same situation as in Example 1, except CPU utilization should not be allowed to go above 90%. Assume also that a Job uses a maximum of 5% CPU time.

Job Server Multiplier Threshold Monitor Value Load Threshold
prd5.example.com 2 150 /System/ProcessServer/MSLN_UNIXS5/Performance/Load 235
prd5.example.com 1 90 /System/ProcessServer/MSLN_UNIXS5/Performance/CPUBusy 235
prd7.example.com 3 150 /System/ProcessServer/MSLN_UNIXS7/Performance/Load 235
prd7.example.com 1 90 /System/ProcessServer/MSLN_UNIXS7/Performance/CPUBusy 235

Note: The Load Threshold will only be reached when the sum of Load and CPUBusy monitor node values for one Job Server reach 235.

Example 3

Assume you have two Job Servers, of which one is used for other work besides RunMyJobs Jobs. Server prd1.example.com is a powerful system that is used by multiple applications, compared to prd3.example.com, which has been added to the pool to relieve prd1.example.com. Assume you do not want to assign Jobs to Job Server prd1.example.com when its CPU usage reaches 80% and you want to have a ratio of 1.5:1 between the two.

To accomplish this, configure the following load factors:

Server Multiplier Threshold MonitorValue Load Threshold
prd1.example.com 3 240 /System/ProcessServer/MSLN_UNIXS1/Performance/CPUBusy

prd3.example.com 2 200 /System/ProcessServer/MSLN_UNIXS3/Performance/CPUBusy

This means that 1% CPU usage is worth 3 units on prd1.example.com and only 2 units on prd3.example.com. In theory, with Jobs using the same amount of resources, prd3.example.com will peak sooner than prd1.example.com. As soon as the CPU usage on pr1.example.com reaches 80%, no new Jobs will be dispatched to it.

jmonitor

You can use the jmonitor command line tool to store monitoring values and display them in the Monitor Nodes screen. Although you are free to use any path, it is highly recommended to store the values under /System/ProcessServer/${process_server}/Custom/. You can create child nodes there to group specific values.