Manual Distributed Training Example

Manual distributed training connects one or more networked machines to distribute CopyCat's training load using the command line, freeing up your local machine so you can keep working. Distributed training works between two or more macOS and Linux machines or between two or more Windows machines running IPv6.

Check Network Connections Between Machines

You can use a simple network tool such as iperf3 to check connectivity between machines before setting up distributed training. See https://iperf.fr/iperf-download.php for details on iperf and how to install it on your OS.

To test if two machines can share data, install iperf on both machines to setup a dummy client-server system, and then run the following commands from the terminal or command prompt:

On Machine 1 with the IP address <IP-address-1>, launch a server on port 30000:

iperf -s -i 10 -p 30000

On Machine 2 with the IP address <IP-address-2>, connect a client to the server on port 30000:

iperf -i 10 -c <IP-address-1> -p 30000

If it works, those machines can run distributed training.

Distribute Training Between Machines

Distributed Training in CopyCat is controlled through the following environment variables:

  • COPYCAT_MAIN_ADDR: the main address. Process 0 runs on this IP address and only this process saves the contact sheets, graphs, and training model checkpoints.
  • COPYCAT_MAIN_PORT: the main port for process 0.
  • COPYCAT_RANK: the current process rank relative to other processes. COPYCAT_RANK must be < COPYCAT_WORLD_SIZE.
  • COPYCAT_WORLD_SIZE: the total number of processes between which training is distributed.
    For example, if COPYCAT_WORLD_SIZE = 4, four processes are launched on the same Nuke script and CopyCat node with the same COPYCAT_MAIN_ADDR and COPYCAT_MAIN_PORT, and with COPYCAT_RANK being 0, 1, 2, and 3.
  • COPYCAT_SYNC_INTERVAL (optional): the interval at which gradients are shared between processes. By default, synchronization happens every 1 step. You can increase this value for better network latency.
  • COPYCAT_LOCAL_ADDR (optional/required for speed): the IP address of the local machine on which you are running distributed training. COPYCAT_LOCAL_ADDR can be used to make sure you are running distributed training on fast network lines.

Note:  If you omit one of the four required environment variables, CopyCat runs in non-distributed mode and Nuke is locked until training completes or is stopped.

Run Distributed Training on Two Linux/Mac Machines

This example uses IP address variables, <IP-address-1> and <IP-address-2>, and we're assuming that the CopyCat node in <your-nuke-script.nk> is CopyCat1.

On Machine 1 (<IP-address-1>):

COPYCAT_MAIN_ADDR=<IP-address-1> COPYCAT_MAIN_PORT=30000 COPYCAT_LOCAL_ADDR=<IP-address-1> COPYCAT_RANK=0 COPYCAT_WORLD_SIZE=2 ./Nuke14.1 -F 1 -X CopyCat1 --gpu <your-nuke-script>.nk

On Machine 2 (<IP-address-2>):

COPYCAT_MAIN_ADDR=<IP-address-1> COPYCAT_MAIN_PORT=30000 COPYCAT_LOCAL_ADDR=<IP-address-2> COPYCAT_RANK=1 COPYCAT_WORLD_SIZE=2 ./Nuke14.1 -F 1 -X CopyCat1 --gpu <your-nuke-script>.nk

Run Distributed Training on Two Windows Machines

Running Distributed Training on Windows is the same as on Linux/Mac, but you may need to turn off the Windows firewall on both machines. You can use both IPv6 and IPv4 for the COPYCAT_MAIN_ADDR and COPYCAT_LOCAL_ADDR, but be aware that IPv4 displays errors on the command line.