Documents

Distributed processing by make

Tutorial slide (pdf)

Introduction

make is a popular, proven-to-work solution for describing and executing tasks with dependencies among them. Its original purpose was automating software builds, but many people know that it is good for any kind of workflows composed of existing programs. The description of make may be somewhat cryptic for beginners but goes straight to the point. It has a natural fault tolerance model in which when some tasks failed due to machine crashes or other temporary reasons, simply re-executing make will perform the rest of the process rather than all. It essentially uses intermediate files as checkpoints. It has a natural parallel execution model where tasks without dependencies may be run in parallel. Finally it has been used by many programmers so the learning barrier is quite low for many people. It is natural for us to take advantage of them.

There seem many similar tools and research projects out there (SGE's built-in dmake, pgmake, distmake, ...), but this one is different in a couple of important ways.

  • It uses GNU make with no modification. You need no compilation or whatsoever to use this system. All great GNU extensions to the original UNIX make come free, and it continues to be the case in future. GNU make supports -j (parallel execution). Our software is a thin layer built on top of this feature.
  • It uses GXP as the underlying execution engine, so it is easy to distribute jobs across clusters, using a variety of underlying resource access protocols (SSH, RSH, torque, SGE, etc.).

Getting Started

Ingredients

  • you must have GNU make installed. Check it with by typing 'make --version.' If you see a message like this, congratulations and you are done.
 make --version
 GNU Make 3.81
 Copyright (C) 2006  Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.
 There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
 PARTICULAR PURPOSE.
 
 This program built for x86_64-pc-linux-gnu

If the command is not found, 'apt-get install make' will do on debian systems. A package will be available on virtually all Unix platforms. The last resort is to build it from source.

Step 1: Prepare Makefile

  • Write a usual Makefile.

Here is a simple example.

 all : 1.result 2.result 3.result 4.result 5.result
 
 %.result : %.data
     your_program $< $@

Running make with this Makefile will result in executing:

 your_program 1.data 1.result
 your_program 2.data 2.result
 your_program 3.data 3.result
 your_program 4.data 4.result
 your_program 5.data 5.result

in sequence, all in the local host. If you want to run them distributed, follow the following two steps.

Step 2: Grab resources with GXP

I assume you are familiar with how to do this with gxpc use and gxpc explore.

Step 3: Run it

 gxpc make -j

instead of the usual make. This will launch the above five tasks in parallel (by -j option given to the command line), and dispatch them to free resources (by the job scheduler started by the above command). One host will execute a task at a time, so if there are only, say, two hosts, three tasks will wait in a queue and will be dispatched when one becomes free.

Features

  • It does not modify GNU make at all. The above 'gxpc make' command simply passes all command line arguments to GNU make without interpretation. Thus, all nice features and commands lines of GNU make will be available here. The most useful 'gxpc make -n' will print what will be done. GNU make's nice pattern matching syntax and advanced features such as $(shell ...) can also be used.
  • It has a modest monitoring and reporting interface via the web. By default, it periodically writes job and host status to state.html in the current directory. If you point to that file with your browser either locally or remotely, you will see a page like this
  • Internally, it uses GNU make's extension MAKEFILES environment variable with which you can load (include) the contents of another makefile before each makefile is read, and "SHELL=" variable with which we can specify the shell to execute individual commands. It is normally /bin/sh, and one rarely wants anything else. By substituting 'mksh' (bundled with GXP distribution) for the regular /bin/sh, we effectively intercept subprocess invocations of GNU make, and redirect them to our job scheduler. The job scheduler is automatically launched when you type 'gxpc make.'
  • For this kind of systems, the central issue is the scalability limitation in the task manager. The key is to minimize the resource consumption of outstanding jobs, jobs that have been spawned by GNU make but not finished. Due to the way GNU make detects terminations of subprocesses, it seems unavoidable to keep at least one process alive for each outstanding job (I assume GNU make detects termination of children by wait system call or friends, perhaps with SIGCHLD handling. I have not confirmed by source code inspection, but it will the case). Thus it is important to minimize the memory usage of such jobs. In this system, one such process consumes hundreds kilobytes of memory, so in modest systems with 1GB of memory will be able to handle hundreds of outstanding jobs comfortably (in general, there is no point in spawning more tasks than the resources you have, so if you have only 100 processors, it is generally a good idea to limit the number of jobs outstanding at a time to somewhere around 100 (or a little bit larger). This can be done by, say, 'gxpc make -j 120').

Extensions to Makefile syntax

You can just write the regular Makefile and just 'gxpc make -j N' will distribute commands to workers. There may be circumstances, however, where you may want to control the execution of jobs more precisely. For example, it may often be true you want to execute certain commands on your local host. There may be cases where certain types of commands should be subject to more strict concurrency control (e.g., the number of tasks downloading big data files from a central web server must be no less than five). To this end, Makefile syntax is slightly extended, but fortunately, the default GNU make needs no modification.

NOTE Extensions described here are still experimental. The exact syntax and semantics are likely to be changed/extended in future.

In front of a command line in a Makefile, you may write some equal signs, optionally followed by resource specifications. For example, let's say the original rule in a Makefile looks like this:

 a.result : a.src
   my_program a.src a.result

All of the following rules are also valid.

 a.result : a.src
   = my_program a.src a.result
 a.result : a.src
   == my_program a.src a.result
 a.result : a.src
   === my_program a.src a.result
 a.result : a.src
   =(webserver) my_program a.src a.result
 a.result : a.src
   =(webserver:3) my_program a.src a.result
 a.result : a.src
   =(webserver:3,fileserver:2) my_program a.src a.result

The first one ('=') has no effect. It is the same as writing nothing. The second one ('==') says this command must be executed locally. The thrid one has the same effect, but it is slightly faster and is NOT subject to any concurrency control.

The fourth and following ones specify the jobs should be distributed, with resource requirements to limit concurrency of certain types of jobs. 'webserver' and 'fileserver' in the above example can be arbitrary alphabetical strings and need not correspond at all to any actual resource. You may consider that each resource name such as 'webserver' and 'fileserver' is a semaphore, and a job consuming resource x will try to decrement the value of samephore x by one. If a resource name is followed by :n (like 'webserver:3' or 'fileserver:2' above), it specifies the count of the resources the job requires. It is like the job tries to decrement the semaphore by that value. The initial value of a semaphore can be given by the --sem command line option. For example, --sem x:n says there are initially n units of resources named x.

To summarize, when the scheduler dispatches jobs, it limits their concurrencies in such a way that for each resource x, the total counts required by all running jobs are not more than the initial count of x given by a --sem x:n option.

If a job requires a resource x and you omit its initial value, it is assumed to be infinite. The effect is simply to ignore the requirement on x altogether (this spec is likely to be reconsidered in future).

Concurrency Control

  • At any moment, there will at most one job running with '==' in front of a command line. This is to avoid accidental overload of the local host. You may set the number of simultaneously runnable jobs at the localhost by giving --sem local:n in the command line.
  • At any moment, each remote resource will have at most one job running on it.
  • At any moment, the concurrency of jobs is subject to the above resource requirement specifications.

Command Line Reference

Synopsis

 gxpc make [ '''GNU make options''' ] [-- '''job scheduler options'' ]

GNU make options are whatever options you like to pass to the underlying GNU make. You can specify -j N, -k, -n, among other things.

Following '--', the rest of the lines are not passed to GNU make and given to the job scheduler. Following the list of job scheduler options. You can see them by:

  gxpc make -- --help

Options

Among others, you may find

  gxpc make -n
  gxpc make -- -n

particularly useful. The former simply lets GNU make print what will be executed. Tha latter, on the other hand, lets GNU make execute commands, and our job scheduler pretends they all succeed instantly. As a side effect, you state.html will be generated. This way, you can see the list of commands that will be executed in the nice html table.

--emulate / -n
pretend all commands finished successfully. useful to generate and see state.html file.
--sem SEM_SPECS
specify concurrency constraints for various jobs. see SEM_SPECS below.
--state FILE
write job/worker status to FILE (default: "state.html")
--log FILE
write log to FILE (default: "xmake.log")
--update_interval_1 T1
specify the inteval at which status file is updated in seconds (default 10.0).
--update_interval_2 T2
specify the inteval at which status file is updated in seconds (default 20.0). more precisely, status file gets updated either when (1) T1 seconds has passed since the last update and there are at least 10 state changes not reflected in the file, or (2) T2 seconds has passed since the last update and there is at least 1 state change not reflected in the file.
--auto_update_interval T
specify the interval at which status file is automatically reloaded by the browser (<meta http-equiv="refresh" content=T> gets inserted in the status file).
--make GNU_MAKE_PATH
specify the path of GNU make (default : "make")
--gxpc GXPC_PATH
specify the path of gxpc (default : "gxpc")
--qlen N
specify the backlog of the socket the scheduler listens (default : 1000). you might want to increase this value when there are more than 1000 workers.

The syntax of SEM_SPECS is as follows

  • SEM_SPECS ::= SEM_SPEC[,SEM_SPEC]*
  • SEM_SPEC ::= id | id:num
  • id is an arbtrary name (without spaces). num is an integer. when num is omitted, 1 is assumed. id:num says there are num units of resources id. You specify which jobs use which resources and how much, by writing =(id,id,...,id) before the command line in the Makefile. See section Concurrency Control above.

Current Limitation and Performance

General cautions

  • When you plan to have high parallelism (300 or above), make sure you have enough memory on the master host. While this system tries to minimize memory consumption of outstanding jobs, it is always good idea to watch memory usage by 'vmstat 1' or alike. If the swap rates (si and so) start growing (on most healthy systems, they should be zero), kill the 'gxpc make' command (just type Ctrl-C and all remote jobs should get killed too).
  • It does no harm when you use powerful machines for the master host (on which you type 'gxpc make'). Experimental results below suggest that it is good to use multi-core machines as the master.
  • In the current implementation, keep the maximum parallelism not more than 1,000. This is because in default Linux kernels, a single process can have 1,024 open file descriptors, and the scheduler consumes a connection for each outstanding process. It is our future plan to remove this limitation.
  • While this system has a scalability limitation just described, in real applications, a severer limitation might exist outside this system. Most typically, NFS becomes a bottleneck for applications that write a lot of data to NFS servers. Avoid such bottlenecks where possible (e.g., write data to local disks).

I ran a simple test to assess performance of the system.

Dispatching null tasks to varying number of processors.

This test runs many trivial tasks that finish immediately, distributes them to varying numbers of processors, and records the total elapsed time of the make command (measured by 'time gxpc make -j'). Specifically, Makefile is this.

 all : $(shell for i in `seq 1 $(N)`; do echo $$i.result ; done)
 
 %.result :
       = hostname

(It defines N targets 1.result, 2.result, ..., N.result and execute a useless hostname command for each X.result). As a result, N instances of hostname command are distributed. We ran this test using three types of processors as the master (GNU make and our scheduler). There are always about 1200 dispatching targets (workers). Each represents a single CPU core. With various values of N, the elapsed time of the whole make command was like this.

n.png

What does this graph tell us?

It suggests that dispatching a single task takes 40ms to 180ms depending on the master processor. It was about 40ms on dual quad core Xeon E5410, 80ms on dual dual core Xeon 5140, and 180ms on single dual core Core2Duo 6400. We refer to thse numbers as '''the master overehad per task.''' Dispatching N tasks takes N * (the master over head per task) on the master, no matter how small tasks are and no matter how many worker processors you have. Note that all master overheads are paid by the master and thus serialized. The curve gives the lower limit to finish N tasks. These numbers also suggest that the master overhead is almost reciprocal of the number of cores in the master. It is understandable because the work of the master is distributed to many processes (spawned by make) and is thus highly concurrent.

Dispatching tasks of varying granularities to 1,000 processors.

While the master overhead per task is clearly a scalability limitation, the above number suggests that for processes taking sufficiently long time (compared to N * the master overhead per task), the relative overhead may not be too significant. For example, if a single process takes 10mins (600sec) to finish, dispatching 1,000 such processes to 1,000 processors will finish in 1000 * (the master overhead per task) + 600sec = 780sec, or with 600/780 = 77% efficiency on the Core2Duo 6400 master. The system is useful in regimes where N * (the master overhead per task) is small compared to typical execution time of a single process.

This test performs a measurement with this perspective. Makefile is this.

 ALL=$(shell for i in `seq 1 1000`; do echo $$i.result ; done)
 all : $(ALL)
 
 %.result :
       = ~/cpu.py $(T)

'~/cpu.py T' is a simple command that just consumes CPU T seconds and quits.

 #!/usr/bin/env python
 
 import sys,time
 
 def main():
     start_t = time.time()
     end_t = start_t + 10.0
     if len(sys.argv) > 1:
         end_t = start_t + int(sys.argv[1])
     while time.time() < end_t:
         pass
 
 main()

We run 'gxpc make -j T=XXX' with XXX from 10 to 120 and record its elapsed the time. There are always about 1200 dispatching targets, but the number of tasks is kept 1000 (due to the file descriptor limitations mentioned above). The result is this.

t.png

What does this graph tell us?

The elapsed time is clearly around 40/80/180sec + T (depending on the master). It confirms that the elapsed time really consists of a constant overhead and the actual job execution time (T).

When T=120sec, the elapsed time is 300sec on the Core2Duo master and the efficiency is a mere 40%. The graph, however, suggests that if T=600sec (10mins), the efficiency will be the predicted 77%.


添付ファイル: filen.png 1674件 [詳細] filet.png 1698件 [詳細]

トップ   編集 バックアップ 添付 複製 名前変更 リロード   新規 一覧 最終更新   最終更新のRSS
© 2007 Taura lab. All Rights Reserved.
Powered by Pukiwiki, styled by Kei_ / Last-modified: 2016-07-12 (火) 15:12:44 (317d)