Automated Distributed Training Example

Distributed training works with off-the-shelf render managers such as OpenCue. Once you’ve set up the render manager, a simple Python script allows you to open the render manager directly from the menu in Nuke, similar to frame-based rendering on render farms.

See https://www.opencue.io/docs/getting-started/ for more information on configuring OpenCue or the relevant documentation for your render manager.

There are a few differences when rendering frames as opposed to running CopyCat training, and this topic is an example guide to help set it up specifically for training. We’ve used OpenCue as an example, but other render managers work similarly. The only limitations are that your chosen render manager must:

In OpenCue, CopyCat training commands are entered into a shell command box, which is part of the job submission GUI. Services with tags can also be used to associate CopyCat training commands with worker machines. Once you have OpenCue set up, and with Nuke installed on the worker machines, install CueGUI on your machine and follow these steps to begin running distributed CopyCat training jobs.

Create a Service with a Tag

A Service is a collection of worker machine requirements including minimum threads, memory, GPU memory and tags. Services can be given a name and applied to each layer of a job in OpenCue.

Each layer (or sub-job) contains a CopyCat command for a particular worker machine. By creating a Service with a tag, applying the Service to the layer and the tag to the worker machine, we can make each layer’s CopyCat command run on the intended worker machine.

The aim is one CopyCat process per machine, assuming the machine has one GPU. Random, unspecified distribution of CopyCat training could mean multiple processes share a GPU, and it would take longer to do one step, in comparison to one process running on one GPU.

In CueCommander, go to Views/Plugins -> CueCommander -> Services.
Click New in the bottom-left of the dialog.
In the properties panel, enter Worker1 in the Name field and in the Custom Tags field.

Click Save to exit.
Repeat the above steps for the number of worker machines in your setup, for example Worker2, Worker3, and Worker4.

Add Tags to the Worker Machines

Tags ensure that CopyCat commands run on the intended worker machine, avoiding random distribution.

In CueCommander, go to Views/Plugins -> CueCommander -> Monitor Hosts.
Right-click on your main CopyCat worker machine and select Add Tags.
Add the Worker1 tag.
Repeat the above steps for the number of worker machines in your setup, for example Worker2, Worker3, and Worker4.

Create the CopyCat Training Job

The OpenCue Submit panel controls job information in CueCommander. The images show examples of what is required in each field.

In the Job Info section, enter a Job Name, User Name, Shot name and any other info you might use to identify a job.

In the Layer Info section, give the layer a name.
Set the Job Type to Shell and then in the Command To Run field, enter:

COPYCAT_MAIN_ADDR=<IP-address-1> COPYCAT_MAIN_PORT=30000 COPYCAT_RANK=0 COPYCAT_WORLD_SIZE=3 /opt/Nuke14.1v1/Nuke14.1 -F #IFRAME# -X CopyCat1 --gpu <your-nuke-script>.nk

• COPYCAT_MAIN_ADDR is the IP address of the worker machine you want the main CopyCat process to run on. The main address is also Worker1.

Note: The machine tagged Worker1 in CueCommander must have the same IP address as COPYCAT_MAIN_ADDR in Command To Run.

• COPYCAT_MAIN_PORT is in the range of ports which are exposed for an external connection, in this case 30000.

Note: Your range of available ports may not include 30000.

• COPYCAT_WORLD_SIZE is the number of worker machines in your setup, in this case 3.

• /opt/Nuke14.1v1 is the install location of Nuke on your worker machine(s).

• #IFRAME# is a variable that automatically expands to the frame number from the CopyCat command.

• CopyCat1 is the name of the node you're processing in the specified script <your-nuke-script>.nk, which must be accessible from a location the worker machines can access, such as a shared directory.

Set Services to Worker1.

Click the + button at the bottom of the dialog to add a new layer.

Give the layer a different Layer Name, such as copycat-training-layer1.
Set the Job Type to Shell and then enter the Command To Run, making sure to increment the COPYCAT_RANK by 1 and entering the COPYCAT_LOCAL_ADDR of Worker2.

COPYCAT_MAIN_ADDR=<IP-address-1> COPYCAT_MAIN_PORT=30000 COPYCAT_RANK=1 COPYCAT_LOCAL_ADDR=<IP-address-2> COPYCAT_WORLD_SIZE=3 /opt/Nuke14.1v1/Nuke14.1 -F #IFRAME# -X CopyCat1 --gpu <your-nuke-script>.nk

Set Services to Worker2.
Repeat the add new layers step depending on the number of worker machines in your setup, in this case four in total:

• Worker 1 - COPYCAT_RANK=0 COPYCAT_LOCAL_ADDR=<IP-address-1>

• Worker 2 - COPYCAT_RANK=1 COPYCAT_LOCAL_ADDR=<IP-address-2>

• Worker 3 - COPYCAT_RANK=2 COPYCAT_LOCAL_ADDR=<IP-address-3>

• Worker 4 - COPYCAT_RANK=3 COPYCAT_LOCAL_ADDR=<IP-address-4>

Submit the CopyCat Training Job

After you've created the services, tags, and jobs within OpenCue, you can submit training jobs to CueTopia, OpenCue's management and monitoring interface.

In the OpenCue Submit panel, click Submit to start the training job. The following image shows the output of a successful run in CueTopia.

Add OpenCue Functions to Nuke's Render Menu

Using Nuke's Python integration, you can add training commands linked to OpenCue directly to Nuke's UI. We've included an example Python script here: copyCatExample.py. You can copy the example into your .nuke directory and then add a few lines to your menu.py file to add the menu options to Nuke's Render menu.

Note: You may have to create the menu.py file if it doesn't already exist.

Copy

Add to menu.py to create Render menu items

1
2
3
4
5
import copycatExample

mainMenu = nuke.menu("Nuke")
renderMenu = mainMenu.menu("Render")
renderMenu.addCommand("Train CopyCat on render farm", "copycatExample.openModalDialog()")