A memory allocator receives requests from the user application for memory
malloc()) and responds with the address of a memory location that can
be used by the application. When the application has finished with the memory
it can return this memory to the allocator (via
free()). Given the central
role memory plays in most applications the performance of the memory allocator
is of critical importance.
Most general purpose allocators that ship with operating system and the C Standard Library implementations can suffer from a number of pitfalls when used within highly multithreaded applications:
- More requests to the operating system for blocks of memory than required. Each request to the operating system results in a system call which requires the CPU on which the request is made to switch to a higher privilege level and execute code in kernel mode. If an allocator makes more requests to the operating system to allocate memory than is required the application’s performance may suffer.
- Use of concurrency primitives for mutual exclusion. Many general purpose allocators weren’t designed for use in multithreaded applications and to protect their critical sections, mutexes or other concurrency primitives are used to enforce correctness. When used in highly multithreaded applications the use of mutexes (either directly or indirectly) can negatively impact the application’s performance.
The general purpose allocator, used on most GNU/Linux operating systems is
ptmalloc2 or a variant of it. Whilst ptmalloc2 exhibits low memory overhead it
struggles to scale as requests are made from multiple threads. Therefore, if
profiling indicates your scene is spending a significant portion of time in
free() you may consider replacing the general purpose
allocator with a scalable alternative.
Internally Geolib3-MT makes uses a number of techniques to handle memory allocation including the judicious of jemalloc to satisfy some requests for memory in the critical path.
Geolib3-MT is designed to scale across multiple cores by cooking locations in
parallel. An Op may declare that it is not thread-safe by calling
Foundry::Katana::GeolibSetupInterface::setThreading(), in which case a
Global Execution Lock (GEL) must be acquired while cooking a location with the
Op. This prevents other thread-unsafe Ops from cooking locations, so is likely
to cause pockets of inefficiency during scene traversal. In profiling your
scenes, first identify and convert all thread unsafe Ops.
Thread-unsafe Ops can easily be identified from the Render Log: if a thread-unsafe Op is detected in the Op tree, a warning is issued:
[WARN plugins.Geolib3-MT.OpTree]: Op (<opName>: <opType>) is marked
ThreadModeGlobalUnsafe - this might degrade performance.
Real world scenes contain many instances of the following op chain construct,
This can arise for a number or reasons, given the versatility of the AttributeSet node, they are frequently used to “hotfix” scenes to ensure a given enabling attribute is present at render time. Or, the sequence of Ops can represent a set of logical steps to be taken during the creation of a given node graph. However, chains of similar Ops represent both a potential overhead and optimization opportunity for Geolib3-MT see (Place chains of collapsible Nodes/Ops together).
Scenes may take advantage of Geolib3-MTs ability to optimize the topology of
the Op tree. Custom Ops may indicate they can be collapsed by calling
Foundry::Katana::GeolibSetupInterface::setOpsCollapsible(). This function
FnAttribute::StringAttribute as parameter, whose value indicates
the name of the attribute to which the collapsed Ops’ arguments will be passed.
For example, consider the chain of four AttributeSet ops above. Geolib3-MT
will collapse this chain of Ops by gathering the Op args for each Op into a
GroupAttribute called “batch”, whose children contain the Op args of each
Op in the chain.
Note: the Op args are passed to the collapsed Op in top down order.
Custom user Ops can participate in the Op chain collapsing functionality by
setOpsCollapsible() as described above. In the above example
setOpsCollapsible("batch") and then tests whether
it’s running in batch mode during its
If participating in the Op chain collapsing system, the Op makes the firm
guarantee that the result of processing a collapsed chain of Ops must be
identical to processing each one in sequence. The implications of this mainly
relate to the querying of upstream locations or attributes where those
locations or attributes could have been produced or modified by the chain that
has been collapsed. As a concrete example, consider a two-Op chain in which the
first Op sets an attribute “hello=world” and the second Op in the chain prints
“Hi There!” if “hello” is equal to “world”. If the second Op uses
Interface::getAttr() to query the value of “hello” in the collapsed chain
the result will be empty, as hello has been set on the location’s output but
did not exist on the input. To remedy this, the op can be refactored to call
While access of an FnAttribute’s data is inexpensive, retention or release of an FnAttribute objects requires modifying a reference count. This will not typically be a problem however, this may accumulate for Ops that create many temporary or short lived FnAttribute objects, especially if many threads are executing the Op at once. In cases where an FnAttribute instance is frequently used, consider whether it may be cached to avoid this overhead.
As a concrete example, consider the following OpScript snippet:
local function generateChildren(count) for i=1,count do local name = Interface.GetAttr("data.name").getValue()..tostring(i) Interface.CreateChild(name) end end
This function creates count children, with names derived from the input attribute data.name. If count is large, accessing this attribute inside the loop causes unnecessary work for two reasons,
- Lua must allocate and deallocate an object for the data.name attribute each iteration, causing more work for the Garbage Collector.
- This allocation and destruction causes an increment and decrement in the attribute’s atomic reference count, potentially introducing stalls between threads.
The second bottleneck will apply when OpScripts accessing the same set of attributes are executed on many threads. To avoid accessing the reference count so frequently, this snippet may be rewritten as follows:
local function generateChildren(count) local stem = Interface.GetAttr("data.name").getValue() for i=1,count do local name = stem .. tostring(i) Interface.CreateChild(name) end end
Now data.name is accessed only once, outside of the loop, so the reference count is not touched while the loop executes.