[OpenWrt-Devel] [PATCH][RFC] libubus: added max retry mechanism
Alexandru Ardelean
ardeleanalex at gmail.com
Tue Aug 5 06:04:47 EDT 2014
Right.
So, this email will start with an apology for all the noise, and false
alarm.
Having said that, I do owe some explanation.
[ I also found it strange that leaking that socket helped. ]
That socket, is valid though, since it was returned by usock() in
ubus_reconnect().
What really happened [and threw me off] was that I was calling
ubus_complete_deferred_request() on an un-deferred request.
This happened after adding libcares to our module. And ares_query() has
this annoying habit of calling our callback during it's call, if it
detects/has any errors during the call.
This is annoying, because it upsets the whole async-ness of things, where
we would not expect to have our callback called, before exiting the
ares_query() exits.
So, we would do:
1) we get a request ubus request
2) start async call setup (link FDs to uloop, etc)
3a) if all went well up until now, defer called ubus request;
3b) if we got an error during setup, do not defer ubus request and reply
over ubus
4) if 3a) then our async callback finishes and completes the deferred
request
Normally we would not expect step 4) to happen before step 3), but it's
what was happening.
The ubus_context was fine, but during ubus_complete_deferred_request() we
would send an invalid request.
One idea would be to allow the ubus_complete_deferred_request() to execute,
only if the deferred flag is set [since it only gets set by the
ubus_defer_request() call ).
Something like this:
https://github.com/commodo/ubus/commit/9e4a1d76859918ced3bd7f8d5d6b4e8c5aaa7482
But that only hides/avoids the problem.
Our fix, was to add a uloop_timeout_set( 0 ) in the callback which we give
to ares_query() and that would call our real callback, and the order of
async things would be restored for all cases (including premature errors).
That's all from me.
Apologies again.
Thanks
Alex
On Sun, Aug 3, 2014 at 3:15 PM, Felix Fietkau <nbd at openwrt.org> wrote:
> On 2014-08-01 18:01, Alexandru Ardelean wrote:
> > Back with some findings.
> >
> > Please check this:
> >
> https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899
> That looks very wrong to me. If that triggers, you're leaking the fd.
>
> > The current problem [high CPU usage] seems to have appeared after some
> > modifications we made to the code which were related to a recent
> > proposal patch I sent [ that got rejected ].
> > Specifically, this one : [PATCH] libubus: handle NULL ubus_context in
> > all sync requests by creating a local instance
> >
> > [Some context]
> > We modified our code in some parts to allocate a ubus_context object for
> > certain sync calls, and free it right after we're done with them [the
> > calls].
> > This was done in order to avoid them getting mixed with async calls on
> > the same ubus_context, and avoid races [ in our logic ].
> >
> > We then got this high CPU usage issue.
> >
> > strace showed something like this [ this is just one cycle; strace was
> > spewing this continuously ]:
> >
> ============================================================================
> > clock_gettime(CLOCK_MONOTONIC, {46921, 889670571}) = 0
> > clock_gettime(CLOCK_MONOTONIC, {46921, 889680749}) = 0
> > epoll_wait(4, {{EPOLLIN, {u32=31068896, u64=31068896}}}, 10, -1) = 1
> > recvmsg(0, 0x7fff5e7ce638, 0) = -1 EAGAIN (Resource
> > temporarily unavailable)
> That looks odd, recvmsg is polling fd 0. Did you close stdin?
>
> >
> ============================================================================
> >
> > Maybe we were also a bit too aggressive with our ubus_context alloc +
> > free approach. We were doing them for each call.
> > When we did some modifications to be a bit less aggressive, the high CPU
> > usage was not reproduce-able [ for the case we used to reproduce it ].
> >
> > Also, for all our main/static ubus_context objects we add a
> > connection_lost callback of our own, which essentially just does a
> > ubus_reconnect() call and some setup again.
> > We could try to use the ubus_auto_connect() call in the future. We
> > implemented the connection_lost callback before this was implemented.
> >
> > Seems that after the ubus_free() call the socket would get closed, which
> > would also call our connection_lost callback, which would reconnect.
> > This seemed to work fine the first few sometimes [ for our case ].
> > It was sometime later [ in our reproducing case 1-2 minutes ] that this
> > connection_lost cb + ubus_reconnect() call would cause strace would
> > start spewing continuously.
> >
> > If we would have left the ubus_default_connection_lost() callback, it
> > would have called uloop_end() and then procd would restart our process,
> > and all would be well again for a while.
> > But we did not want that.
> Are you sure that this is a bug in libubus? If this is indeed the call
> from within libubus, then it appears that something may have corrupted
> the ubus_context. I think you need to take another look at the code.
>
> > [End Some context]
> >
> > After trying out a few things, and tracing the code, it seemed that not
> > closing the socket in ubus_reconnect() helped.
> > I will say directly that I am not understanding the libubus code 100%
> > yet, and I am trying not to learn libubox yet, but I think that not
> > closing the socket if ctx->sock.eof is true, prevents a racy condition
> > where we would close some other context's socket.
> > Not sure about this, which is why I am asking some opinions.
> >
> > Maybe Felix or someone can point us in a better direction with the
> > investigation.
> > Still, it makes sense to not close an eof-ed socket, and for our current
> > case, I do not see the process taking up all the CPU anymore.
> I completely disagree. Why would it make sense to leak the fd if eof is
> set? The fact that it somehow prevents your 100% CPU loop could be
> something unrelated.
>
> - Felix
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.infradead.org/pipermail/openwrt-devel/attachments/20140805/91d4a8c8/attachment.htm>
-------------- next part --------------
_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel
More information about the openwrt-devel
mailing list