割りと便利!w ppssで手軽に並列処理をしてみる
分散処理を手軽にやりたい。
GNU Parallel も便利だけど、ppssならシェルスクリプトだから設置するだけで実行できて便利!
って事で試してみた。
インストール
これだけ。
$ cd ~/bin $ wget http://ppss.googlecode.com/files/ppss-2.85.tgz $ tar zxvf ppss-2.85.tgz $ rm ppss-2.85.tgz
オプション
今回使うオプション
# -d 対象ディレクトリ # -f リストファイル # -c 実行コマンド # -p 並列数
ppss --helpの結果
kuwano@kuwano03:~/bin$ ./ppss -h |P|P|S|S| Distributed Parallel Processing Shell Script 2.85 PPSS is a Bash shell script that executes commands in parallel on a set of items, such as files in a directory, or lines in a file. The purpose of PPSS is to make it simple to benefit from multiple CPUs or CPU cores. This short summary only discusses options for stand-alone mode. For a full listing of all options, run PPSS with the options --help Usage ./ppss [ options ] --command | -c Command to execute. Syntax: '<command> ' including the single quotes. Example: -c 'ls -alh '. It is also possible to specify where an item must be inserted: 'cp "$ITEM" /somedir'. --sourcedir | -d Directory that contains files that must be processed. Individual files are fed as an argument to the command that has been specified with -c. --sourcefile | -f Each single line of the supplied file will be fed as an item to the command that has been specified with -c. Read input from stdin with -f - --config | -C If the mode is config, a config file with the specified name will be generated based on all the options specified. In the other modes. this option will result in PPSS reading the config file and start processing items based on the settings of this file. --disable-ht | -j Disable hyper threading. Is enabled by default. --log | -l Sets the name of the log file. The default is ppss-log.txt. --processes | -p Start the specified number of processes. Ignore the number of available CPUs. --quiet | -q Shows no output except for a progress indication using percents. --delay | -D Adds an initial random delay to the start of all parallel jobs to spread the load. The delay (seconds) is only used at the start of all 'threads'. --daemon Daemon mode. Do not exit after items are professed, but keep looking for new items and process them. Read the manual how to use this! See --help for important additional options regarding daemon mode. --no-recursion|-r By default, recursion of directories is enabled when the -d option is used. If this is not prefered, this can be disabled with this option Only files within the specified directory will be processed. --email | -e PPSS sends an e-mail if PPSS has finished. It is also used if processing of an item has failed (configurable, see -h). --help Extended help, including options for distributed mode and Amazon EC2. Example: encoding some wav files to mp3 using lame: ./ppss -d /path/to/wavfiles -c 'lame ' Extended usage: use --help
使い方例
特定ディレクトリ以下にあるファイルを2並列処理でgzipする
- 下準備
$ mkdir -p ~/test; for i in `seq 1 100` ; do echo "test $i" > ~/test/test$i ;done
- 実行
kuwano@kuwano03:~$ ~/bin/ppss -d ~/test/ -c "gzip " -p 2 12月 22 16:20:09: 12月 22 16:20:09: ========================================================= 12月 22 16:20:09: |P|P|S|S| 12月 22 16:20:09: Distributed Parallel Processing Shell Script vers. 2.85 12月 22 16:20:09: ========================================================= 12月 22 16:20:09: Hostname: kuwano03 12月 22 16:20:09: --------------------------------------------------------- 12月 22 16:20:10: CPU: Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz 12月 22 16:20:10: Starting 2 parallel workers. 12月 22 16:20:10: --------------------------------------------------------- 12月 22 16:20:26: One job is remaining. 12月 22 16:20:26: Total processing time (hh:mm:ss): 00:00:17 12月 22 16:20:26: Finished. Consult ppss_dir/job_log for job output.
URLリストファイルにあるURLを2並列処理で~/testにwget->保存する
- 下準備
$ cat <<'EOF' >urllist.txt http://www.ameba.jp/ http://now.ameba.jp/ http://d.hatena.ne.jp/akuwano EOF
- 実行
kuwano@kuwano03:~$ ~/bin/ppss -f urllist.txt -c 'wget -q -P ~/test/ "$ITEM"' -p 2 12月 22 16:30:58: 12月 22 16:30:58: ========================================================= 12月 22 16:30:58: |P|P|S|S| 12月 22 16:30:58: Distributed Parallel Processing Shell Script vers. 2.85 12月 22 16:30:58: ========================================================= 12月 22 16:30:58: Hostname: kuwano03 12月 22 16:30:58: --------------------------------------------------------- 12月 22 16:30:58: CPU: Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz 12月 22 16:30:58: Starting 2 parallel workers. 12月 22 16:30:58: --------------------------------------------------------- 12月 22 16:30:59: One job is remaining. 12月 22 16:31:00: Total processing time (hh:mm:ss): 00:00:02 12月 22 16:31:00: Finished. Consult ppss_dir/job_log for job output.
$ITEMには-d ディレクトリリストか、-f ファイルリストから渡された値が入ってくるので、実際に実行するコマンドに当てはめてやる。
ログ
ログのディレクトリが、デフォルトは./ppss_logにつくられて、処理結果が保存されるので確認。
-l オプション でログ出力を変更可能
なんか
他にもサーバ間で分散処理させたり、デーモンとして動かすこともできるらしいです。
まだ試してないけど便利かもねー。(無責任w)
割りと便利なのと手軽に使えるのでちょいっとした処理の時にサクッと入れるといいかも。