Automated Distributed Training Example

Distributed training works with off-the-shelf render managers such as OpenCue. Once you’ve set up the render manager, a simple Python script allows you to open the render manager directly from the menu in Nuke, similar to frame-based rendering on render farms.

See https://www.opencue.io/docs/getting-started/ for more information on configuring OpenCue or the relevant documentation for your render manager.

There are a few differences when rendering frames as opposed to running CopyCat training, and this topic is an example guide to help set it up specifically for training. We’ve used OpenCue as an example, but other render managers work similarly. The only limitations are that your chosen render manager must:

In OpenCue, CopyCat training commands are entered into a shell command box, which is part of the job submission GUI. Services with tags can also be used to associate CopyCat training commands with worker machines. Once you have OpenCue set up, and with Nuke installed on the worker machines, install CueGUI on your machine and follow these steps to begin running distributed CopyCat training jobs.

Create a Service with a Tag

A Service is a collection of worker machine requirements including minimum threads, memory, GPU memory and tags. Services can be given a name and applied to each layer of a job in OpenCue.

Each layer (or sub-job) contains a CopyCat command for a particular worker machine. By creating a Service with a tag, applying the Service to the layer and the tag to the worker machine, we can make each layer’s CopyCat command run on the intended worker machine.

The aim is one CopyCat process per machine, assuming the machine has one GPU. Random, unspecified distribution of CopyCat training could mean multiple processes share a GPU, and it would take longer to do one step, in comparison to one process running on one GPU.

Add Tags to the Worker Machines

Tags ensure that CopyCat commands run on the intended worker machine, avoiding random distribution.

Create the CopyCat Training Job

The OpenCue Submit panel controls job information in CueCommander. The images show examples of what is required in each field.

Submit the CopyCat Training Job

After you've created the services, tags, and jobs within OpenCue, you can submit training jobs to CueTopia, OpenCue's management and monitoring interface.

In the OpenCue Submit panel, click Submit to start the training job. The following image shows the output of a successful run in CueTopia.

Add OpenCue Functions to Nuke's Render Menu

Using Nuke's Python integration, you can add training commands linked to OpenCue directly to Nuke's UI. We've included an example Python script here: copyCatExample.py. You can copy the example into your .nuke directory and then add a few lines to your menu.py file to add the menu options to Nuke's Render menu.

Note:  You may have to create the menu.py file if it doesn't already exist.

Copy
Add to menu.py to create Render menu items
import copycatExample

mainMenu = nuke.menu("Nuke")
renderMenu = mainMenu.menu("Render")
renderMenu.addCommand("Train CopyCat on render farm", "copycatExample.openModalDialog()")