Manual Distributed Training Example

Manual distributed training connects one or more networked machines to distribute CopyCat's training load using the command line, freeing up your local machine so you can keep working. Distributed training works between two or more macOS and Linux machines or between two or more Windows machines running IPv6.

Check Network Connections Between Machines

You can use a simple network tool such as iperf3 to check connectivity between machines before setting up distributed training. See https://iperf.fr/iperf-download.php for details on iperf and how to install it on your OS.

To test if two machines can share data, install iperf on both machines to setup a dummy client-server system, and then run the following commands from the terminal or command prompt:

On Machine 1 with the IP address <IP-address-1>, launch a server on port 30000:

iperf -s -i 10 -p 30000

On Machine 2 with the IP address <IP-address-2>, connect a client to the server on port 30000:

iperf -i 10 -c <IP-address-1> -p 30000

If it works, those machines can run distributed training.

Distribute Training Between Machines

Distributed Training in CopyCat is controlled through the following environment variables:

Note:  If you omit one of the four required environment variables, CopyCat runs in non-distributed mode and Nuke is locked until training completes or is stopped.

Note:  If you encounter network address errors on Windows, make sure there are not trailing spaces directly after the IP address in the environment variable.

Run Distributed Training on Two Linux/Mac Machines

This example uses IP address variables, <IP-address-1> and <IP-address-2>, and we're assuming that the CopyCat node in <your-nuke-script.nk> is CopyCat1.

On Machine 1 (<IP-address-1>):

COPYCAT_MAIN_ADDR=<IP-address-1> COPYCAT_MAIN_PORT=30000 COPYCAT_LOCAL_ADDR=<IP-address-1> COPYCAT_RANK=0 COPYCAT_WORLD_SIZE=2 ./Nuke14.1 -F 1 -X CopyCat1 --gpu <your-nuke-script>.nk

On Machine 2 (<IP-address-2>):

COPYCAT_MAIN_ADDR=<IP-address-1> COPYCAT_MAIN_PORT=30000 COPYCAT_LOCAL_ADDR=<IP-address-2> COPYCAT_RANK=1 COPYCAT_WORLD_SIZE=2 ./Nuke14.1 -F 1 -X CopyCat1 --gpu <your-nuke-script>.nk

Run Distributed Training on Two Windows Machines

Running Distributed Training on Windows is the same as on Linux/Mac, but you may need to turn off the Windows firewall on both machines. You can use both IPv6 and IPv4 for the COPYCAT_MAIN_ADDR and COPYCAT_LOCAL_ADDR, but be aware that IPv4 displays errors on the command line.