Samsung drivers slow doing multithreaded command buffer generation

Test Device: Galaxy S10e, Android 12, G76 GPU

Looks like in the Android Studio profiler, the driver is much slower using threads to generate command buffers that using the main core. This seems to stem from every Vulkan call using the following sequence. I don’t have validation layer enabled, and have the “Enable gpu debug layers” disabled in the Developer Options. On a 45ms frame, this adds +13ms when I try to use 3 cores - the main big core, another big core, and a medium core.

There is a long sequence of system calls I assume to get cpu timer that likely doesn’t have to be done under a futex. And these also seem tied to validation calls which also lock a futex.

It’s unclear from the profile if the futex is shared across all the job threads I have, but it seems like it must be. Calls like BeingRenderPass have 2 to 4 of these timing and or validation calls in them. Each validation call causes a __futex_wait_ex syscall from MutexLockWithTimeout.

How do I disable these timers and validation calls, and get back to threadable Vulkan code without a slowdown?

VulkanCommandEncoder::BindResources
libVkLayer_khronos_validation.so
libVKLayer_khronos_validation.so
NonPI::MutexLockWithTimout
__futex_wait_ex
syscall
[kernal.kallsysm]x8

The timing calls were libVkLayer_cpuTiming followed by validation and mutex locks as per above. Now that I have “Enable gpu debug layers” disabled, I don’t see those, but do see the validation calls/futex above.

Turns out the validation layer was still enabled. This seems to cause all code to go through a single choke-point. So disabling that improves, but doesn’t totally fix the multithread command gen. The command gen is still slow because of the 2x and 4x slower med/little cores. So we’ll have to have a restricting strategy to only use faster cores. Even medium cores slow things down, since they are 2x slower on this 2/2/4 architecture.