Improving Op Performance

Consider using a scalable memory allocator

A memory allocator receives requests from the user application for memory (via malloc()) and responds with the address of a memory location that can be used by the application. When the application has finished with the memory it can return this memory to the allocator (via free()). Given the central role memory plays in most applications the performance of the memory allocator is of critical importance.

Most general purpose allocators that ship with operating system and the C Standard Library implementations can suffer from a number of pitfalls when used within highly multithreaded applications:

  • More requests to the operating system for blocks of memory than required. Each request to the operating system results in a system call which requires the CPU on which the request is made to switch to a higher privilege level and execute code in kernel mode. If an allocator makes more requests to the operating system to allocate memory than is required the application’s performance may suffer.

  • Use of concurrency primitives for mutual exclusion. Many general purpose allocators weren’t designed for use in multithreaded applications and to protect their critical sections, mutexes or other concurrency primitives are used to enforce correctness. When used in highly multithreaded applications the use of mutexes (either directly or indirectly) can negatively impact the application’s performance.

The general purpose allocator, used on most GNU/Linux operating systems is ptmalloc2 or a variant of it. Whilst ptmalloc2 exhibits low memory overhead it struggles to scale as requests are made from multiple threads. Therefore, if profiling indicates your scene is spending a significant portion of time in calls to malloc()/free() you may consider replacing the general purpose allocator with a scalable alternative.

Internally Geolib3-MT makes uses a number of techniques to handle memory allocation including the judicious of jemalloc to satisfy some requests for memory in the critical path.

Summary

Multithreaded applications benefit from using a memory allocator that is thread aware and thread efficient. Geolib3-MT uses jemalloc internally, you are free to use an allocator of your choice provided the memory allocated by your plugins is also deallocated by the same allocator.

Mark custom Ops as thread-safe where possible

Geolib3-MT is designed to scale across multiple cores by cooking locations in parallel. An Op may declare that it is not thread-safe by calling Foundry::Katana::GeolibSetupInterface::setThreading(), in which case a Global Execution Lock (GEL) must be acquired while cooking a location with the Op. This prevents other thread-unsafe Ops from cooking locations, so is likely to cause pockets of inefficiency during scene traversal. In profiling your scenes, first identify and convert all thread unsafe Ops.

Thread-unsafe Ops can easily be identified from the Render Log: if a thread-unsafe Op is detected in the Op tree, a warning is issued:

[WARN plugins.Geolib3-MT.OpTree]: Op (<opName>: <opType>) is marked ThreadModeGlobalUnsafe - this might degrade performance.

Summary

If an Op is not designed to run in a multithreaded context, Foundry::Katana::GeolibSetupInterface::setThreading() can be used to restrict processing to a single thread for that op type. Begin scene profiling and optimization by identifying thread unsafe ops.

Mark custom Ops as collapsible where possible

Real world scenes contain many instances of the following op chain construct,

../_images/AttributeSetChain.png

This can arise for a number or reasons, given the versatility of the AttributeSet node, they are frequently used to “hotfix” scenes to ensure a given enabling attribute is present at render time. Or, the sequence of Ops can represent a set of logical steps to be taken during the creation of a given node graph. However, chains of similar Ops represent both a potential overhead and optimization opportunity for Geolib3-MT see (Place chains of collapsible Nodes/Ops together).

Scenes may take advantage of Geolib3-MTs ability to optimize the topology of the Op tree. Custom Ops may indicate they can be collapsed by calling Foundry::Katana::GeolibSetupInterface::setOpsCollapsible(). This function takes an FnAttribute::StringAttribute as parameter, whose value indicates the name of the attribute to which the collapsed Ops’ arguments will be passed.

For example, consider the chain of four AttributeSet ops above. Geolib3-MT will collapse this chain of Ops by gathering the Op args for each Op into a GroupAttribute called “batch”, whose children contain the Op args of each Op in the chain.

  • batch
    • Op1

    • Op2

    • Op3

    • Op4

Note: the Op args are passed to the collapsed Op in top down order.

Custom user Ops can participate in the Op chain collapsing functionality by calling setOpsCollapsible() as described above. In the above example AttributeSet calls setOpsCollapsible("batch") and then tests whether it’s running in batch mode during its cook() call.

If participating in the Op chain collapsing system, the Op makes the firm guarantee that the result of processing a collapsed chain of Ops must be identical to processing each one in sequence. The implications of this mainly relate to the querying of upstream locations or attributes where those locations or attributes could have been produced or modified by the chain that has been collapsed. As a concrete example, consider a two-Op chain in which the first Op sets an attribute “hello=world” and the second Op in the chain prints “Hi There!” if “hello” is equal to “world”. If the second Op uses Interface::getAttr() to query the value of “hello” in the collapsed chain the result will be empty, as hello has been set on the location’s output but did not exist on the input. To remedy this, the op can be refactored to call Interface::getOutputAttr() instead.

Summary

Participate in Op chain collapsing when Geolib3-MT identifies a series of Ops of the same type in series. Use setOpsCollapsible() to indicate your participation and refactor you Op’s code to handle the batch Op arguments.

Cache frequently accessed attributes

While access of an FnAttribute’s data is inexpensive, retention or release of an FnAttribute objects requires modifying a reference count. This will not typically be a problem however, this may accumulate for Ops that create many temporary or short lived FnAttribute objects, especially if many threads are executing the Op at once. In cases where an FnAttribute instance is frequently used, consider whether it may be cached to avoid this overhead.

As a concrete example, consider the following OpScript snippet:

local function generateChildren(count)
    for i=1,count do
        local name = Interface.GetAttr("data.name").getValue()..tostring(i)
        Interface.CreateChild(name)
    end
end

This function creates count children, with names derived from the input attribute data.name. If count is large, accessing this attribute inside the loop causes unnecessary work for two reasons,

  • Lua must allocate and deallocate an object for the data.name attribute each iteration, causing more work for the Garbage Collector.

  • This allocation and destruction causes an increment and decrement in the attribute’s atomic reference count, potentially introducing stalls between threads.

The second bottleneck will apply when OpScripts accessing the same set of attributes are executed on many threads. To avoid accessing the reference count so frequently, this snippet may be rewritten as follows:

local function generateChildren(count)
    local stem = Interface.GetAttr("data.name").getValue()
    for i=1,count do
        local name = stem .. tostring(i)
        Interface.CreateChild(name)
    end
end

Now data.name is accessed only once, outside of the loop, so the reference count is not touched while the loop executes.

Summary

Where possible, refactor OpScripts and C++ Ops to access attributes outside of tight loops. This prevents unnecessary modification of attribute reference counts, reducing the number of stalls when these Ops are executed on many threads.