[nuttx] SO_LINGER timeout not working

Discussion:

Pelle Windestam Pelle.Windestam@tagmaster.com [nuttx]

2018-02-14 12:15:13 UTC

Hi,

I have an application which is having some problems when it closes a TCP socket when the remote end has "disappeared" (i.e. network cable unplugged on the remote end). The socket has the SO_LINGER option enabled and set to 1 second. The issue is that the close() call does not return after 1 second, but after 120 seconds when the connection times out in tcp_timer().

tcp_close_disconnect() is called by close(), sets up a callback to tcp_close_eventhandler(), triggers the ethernet device to poll the tcp/ip stack(?) and waits for a semaphore. The tcp/ip stack calls the tcp_close_eventhandler() callback function which sets the TCP_CLOSE flag and checks if the SO_LINGER timout has elapsed, which it hasn't since this is done rigth after the call to close(). The problem seems to be that the tcp_close_eventhandler() callback is not called again until the timout in tcp_timer(), but it needs to be called when the SO_LINGER timout runs out, to release the semphore that tcp_close_disconnect() is waiting for.

I'm still trying to wrap my head around how all these bits fit together, but if anyone can shed some light on the issue and/or possible solutions I am happy to hear it.

//Pelle

spudarnia@yahoo.com [nuttx]

2018-02-14 15:17:05 UTC

Permalink

Other people are using SO_LINGER so I think it is basically functional, but there is something different about what you are seeing in your environment.

Specifically I am thinking this condition "..the remote end has "disappeared" (i.e. network cable unplugged on the remote end)." Normally, losing the network has no effect on the Ethernet driver since it is completely unaware of what the PHY is doing.

Unless you have network monitoring logic enabled. That logic will receive an interrupt when the link is lost and take the network down.

Could that be happening? When the network is in the DOWN state, the timer heartbeat poll stops and everything is frozen in place. That seems to be what you are describing.

When the network goes down, the network device will generate the NETDEV_DOWN event. The SO_LINGER logic should register with the device interface logic (not with the TCP logic) to receive the NETDEV_DOWN event. When the device is down, the linger wait should terminate.

It looks to me like the network close logic handles these event: (TCP_NEWDATA | TCP_POLL | TCP_DISCONN_EVENTS) where TCP_DISCONN_EVENTS does include NETDEV_DOWN. In registers the events like:

state.cl_cb = tcp_callback_alloc(conn))

Which is an alias for:

devif_callback_alloc((conn)->dev, &(conn)->list)

That all seems okay, but somewhere in there is probably a condition where the downed network is causing the hang. At least that is my suspicion.

Greg

Pelle Windestam Pelle.Windestam@tagmaster.com [nuttx]

2018-02-16 13:00:43 UTC

Permalink

Unless you have network monitoring logic enabled.Â That logic will receive an interrupt when the link is lost and take
the network down.
Could that be happening?Â When the network is in the DOWN state, the timer heartbeat poll stops and everything is frozen
in place.Â That seems to be what you are describing.
When the network goes down, the network device will generate theÂ NETDEV_DOWN event.Â The SO_LINGER logic should register
with the device interface logic (not with the TCP logic) to receive the NETDEV_DOWN event.Â When the device is down,
the linger wait should terminate.

This is not the case, our PHY is actually a network switch with three ports, so the link is always "up" no matter what the cable status is on the outside.

It looks to me like the network close logic handles these event: Â (TCP_NEWDATA | TCP_POLL | TCP_DISCONN_EVENTS)

state.cl_cb = tcp_callback_alloc(conn))
devif_callback_alloc((conn)->dev, &(conn)->list)
That all seems okay, but somewhere in there is probably a condition where the downed network is causing the hang.Â At least
that is my suspicion.

The tcp_timer() function is periodically called, but since the conn->unacked counter becomes > 0 when the cable is disconnected, tcp_callback() is never called until the TCP connection times out.

Perhaps I am missing something but there seems to be nothing calling tcp_callback() when the SO_LINGER timeout is reached, until the tcp connection timeouts after all retransmit attempts have failed. It seems like tcp_close_eventhandler() would have to be called periodically until it determines that the SO_LINGER timeout is reached.

//Pelle

spudarnia@yahoo.com [nuttx]

2018-02-16 14:12:45 UTC

Permalink

Now this is soundling like the same topic as https://groups.yahoo.com/neo/groups/nuttx/conversations/messages/17074

also earlier: https://groups.yahoo.com/neo/groups/nuttx/conversations/messages/3236

Is that the same issue?

Sebastien Lorquet sebastien@lorquet.fr [nuttx]

2018-02-16 14:17:17 UTC

Permalink

No idea, but it looks similar, at least the conditions for triggering the issues
are the same, logical link disconnection without a physical link disconnection.

Sebastien

Post by ***@yahoo.com [nuttx]
Â
Now this is soundling like the same topic as
https://groups.yahoo.com/neo/groups/nuttx/conversations/messages/17074
https://groups.yahoo.com/neo/groups/nuttx/conversations/messages/3236
Is that the same issue?

Pelle Windestam Pelle.Windestam@tagmaster.com [nuttx]

2018-02-19 08:40:14 UTC

Permalink

No idea, but it looks similar, at least the conditions for triggering the issues are the same, logical link disconnection
without a physical link disconnection.
Sebastien

It does sound very similar. I managed to get the proper behavior by simply disabling SO_LINGER, then it will close the socket immediately even if there is pending data. I solved the case of being stuck in write() by setting up a timer to deliver a signal within a timeout limit, which will cancel the write() if it does not return in time.

//Pelle