[sles-beta] Tasks get stuck with RT priority

Thu May 8 23:48:56 MDT 2014

On Fri, 2014-05-09 at 04:27 +0000, Laxman, Mahendra wrote:
> Hi Libor,
> 
> Thanks for the suggestions. We have tested them and still see the
> kworker get stuck in the run queue. Afaik I think keeping kworker not
> scheduled for a long time is not good. Further we found gpg is get
> stuck when any one of the kworkes running in each core is get stuck. 

Yes.  Starving your own kernel indefinitely is a bad idea.  As things
stand, you simply cannot safely saturate any core.  No matter what you
do, workqueues may need to run even on an isolated core, so taking a
brief breath to let them run is a must.  That will change, but in the
here and now, 100.0% saturation is dangerous.

One aspect of that you met, tasks on non-isolated cores may depend upon
a workqueue running on the core you are monopolizing to make progress,
so your rt hog can block other tasks all over the box indefinitely.

> By the way we had already tested several of the settings you have
> suggested other than the 'echo -1
> > /proc/sys/kernel/sched_rt_runtime_us'.
> This looks promising that now stress will monopolize the given core.
> We see zero context switches of stress with this settings.

Yes, if you really want to saturate with rt tasks, turning the throttle
off is a must.  You also want it turned off because it is a global
throttle.  When the timer fires, the CPU running that timer will
traverse the entire box, perturbing all.   Should the timer happen to
run on your critical CPU, it will spend quite a few cycles doing that
instead of doing your latency critical work.
> 
> The issue is when running gpg, It doesn't matter whether we isolated
> the core, until the kworker is stuck, gpg will be stuck. See the stack
> of the gpg below when it is stuck. Afaik I think flush_work is a work
> which submitted to work queues (kworkers) of all cores by the gpg (on
> behalf of gpg), since that work does not served in the stressed core,
> gpg get stuck. 
> 
> # cat /proc/`pidof gpg`/stack 
> [<ffffffff8107c102>] flush_work+0x22/0x30
> [<ffffffff8107c223>] schedule_on_each_cpu+0xb3/0xf0
> [<ffffffff81123055>] sys_mlock+0x45/0x130
> [<ffffffff81466712>] system_call_fastpath+0x16/0x1b
> [<00007fbf0f286aa7>] 0x7fbf0f286aa7
> [<ffffffffffffffff>] 0xffffffffffffffff

That is the exact situation I mentioned above.  Until you allow the
kworker thread your rt hog is blocking to run, that synchronization will
not ever complete, that pgp task is as good as dead.

What you want, but currently cannot have, is bare metal.  A feature to
get as close to that as possible is underway, but not ready for
primetime.  When that effort is complete, you will no longer need to run
your polling task with rt policy/priority, the improved isolation will
remove the competition you are trying in vain to keep completely away
from your precious core.
> 
-Mike