Freddie Chopin freddie_chopin@op.pl [nuttx]
2016-01-21 14:23:53 UTC
Hello Greg!
I would like to report a problem in nuttx and also ask you about
possible solutions. The solution we're hoping on is a complete fix in
nuttx, which would probably cause some API changes. Otherwise we would
have to introduce and maintain our own fixes in nuttx, implement
numerous workarounds or completely replace RTOS in this project.
The problem I would like to report is extremely serious and caused the
project we are working on to just silently loose it's main function
after some time. This is far worse than a crash, because on the outside
it all seemed to work perfectly fine, but the main functionality of the
project was not executed.
The problem was discovered in a rather old version of nuttx - we
were using something in between 7.2 and 7.3, but that doesn't matter -
current master still has the same issue.
I'll start with the description of the original issue we had - please
note that this applies to nuttx version 7.2 - in current master the
call path is a bit different.
Our project uses standard 100Hz tick frequency.
We used mq_timedwait() and mq_timedreceive() with a timeout value "10ms
from last timeout" and the timeout condition was essential in the
application. This worked pretty nice, but after about 49 days and 17
hours of continuous uptime the timeout was not detected at all. This is
a short description, but believe me that finding a problem like that is
extremely hard - it was over half a year since the day of the first
report (the first time someone noticed something is wrong) till the day
of finding the root cause of the problem. Fortunately in that time no
expensive machinery was fried to ashes.
It turns out that mq_timedwait() and mq_timedreceive() both
called clock_abstime2ticks() to convert the absolute timeout (timespec)
to number of ticks since system start. This function
called clock_gettime(), which just returned current time. The real
problem was there in line 177 (Â https://bitbucket.org/patacongo/nuttx/s
rc/681a5bb37a25659eaee38be73b9b5bce1d0a2e5b/sched/clock_gettime.c?at=nu
ttx-7.2&fileviewer=file-view-default#clock_gettime.c-177Â ). The
conversion of "ticks since system start" to milliseconds overflows just
after (2^32)/10 system ticks, which is about 49 days and 17 hours (with
100Hz tick frequency). When you request a timeout around this moment,
you will get some ridiculous value - in our case instead of "10ms from
last timeout" we got something like "40 days from last timeout".
Because of this, there was no timeout reported and the application
stopped behaving like it should, but without any other symptom of the
issue - mq_timedreceive()/mq_timedsend() just didn't return with
ETIMEOUT when they should.
In current master the call path is different, but I believe that it has
the same problem, but now probably here -Â https://bitbucket.org/patacon
go/nuttx/src/eb44ef2bcde9692d0e7a170490f0fbb1b0d8a59e/sched/clock/clock
_systimespec.c?at=master&fileviewer=file-view-
default#clock_systimespec.c-149Â . Even if the variable holding
milliseconds (confusingly named "usecs") is 64-bit long, the
calculations would be 32-bit anyway (I checked on ARM Cortex-M3).
Despite the information in nuttx/sched/Kconfig, the tick timer (with
default 100Hz tick frequency) overflows after just 497 days of uptime,
not 13.5 years.
the ticks will be accumulated in a 64-bit variable, but all code base
used in exactly one place in nuttx - in fs/procfs/fs_procfsuptime.c.
So you actually cannot change system timer to be 64-bit long -
everywhere this value is just truncated to 32-bits. But even if that
worked fine, then it wouldn't change much - almost everywhere the value
returned by clock_systimer() function is saved in a 32-bit variable.Â
mq_timed*() functions are just some problematic places, there are
other:
- all synchronization functions with timeout (for semaphores, mutexes,
condition variables, ...),
- POSIX timers,
- delay functions (sleep(), nanosleep(), ...),
- FreeMODBUS timeout functions,
- ...
In an industrial environment - like the one our project is in - uptime
of 49.7 or even 497 days is not a problem, but after that time nuttx
just fails to work reliably.
What are your suggestions?
Regards,
FCh
BTW - in clock_systimespec.c there's probably a bug around line 149.
I would like to report a problem in nuttx and also ask you about
possible solutions. The solution we're hoping on is a complete fix in
nuttx, which would probably cause some API changes. Otherwise we would
have to introduce and maintain our own fixes in nuttx, implement
numerous workarounds or completely replace RTOS in this project.
The problem I would like to report is extremely serious and caused the
project we are working on to just silently loose it's main function
after some time. This is far worse than a crash, because on the outside
it all seemed to work perfectly fine, but the main functionality of the
project was not executed.
The problem was discovered in a rather old version of nuttx - we
were using something in between 7.2 and 7.3, but that doesn't matter -
current master still has the same issue.
I'll start with the description of the original issue we had - please
note that this applies to nuttx version 7.2 - in current master the
call path is a bit different.
Our project uses standard 100Hz tick frequency.
We used mq_timedwait() and mq_timedreceive() with a timeout value "10ms
from last timeout" and the timeout condition was essential in the
application. This worked pretty nice, but after about 49 days and 17
hours of continuous uptime the timeout was not detected at all. This is
a short description, but believe me that finding a problem like that is
extremely hard - it was over half a year since the day of the first
report (the first time someone noticed something is wrong) till the day
of finding the root cause of the problem. Fortunately in that time no
expensive machinery was fried to ashes.
It turns out that mq_timedwait() and mq_timedreceive() both
called clock_abstime2ticks() to convert the absolute timeout (timespec)
to number of ticks since system start. This function
called clock_gettime(), which just returned current time. The real
problem was there in line 177 (Â https://bitbucket.org/patacongo/nuttx/s
rc/681a5bb37a25659eaee38be73b9b5bce1d0a2e5b/sched/clock_gettime.c?at=nu
ttx-7.2&fileviewer=file-view-default#clock_gettime.c-177Â ). The
conversion of "ticks since system start" to milliseconds overflows just
after (2^32)/10 system ticks, which is about 49 days and 17 hours (with
100Hz tick frequency). When you request a timeout around this moment,
you will get some ridiculous value - in our case instead of "10ms from
last timeout" we got something like "40 days from last timeout".
Because of this, there was no timeout reported and the application
stopped behaving like it should, but without any other symptom of the
issue - mq_timedreceive()/mq_timedsend() just didn't return with
ETIMEOUT when they should.
In current master the call path is different, but I believe that it has
the same problem, but now probably here -Â https://bitbucket.org/patacon
go/nuttx/src/eb44ef2bcde9692d0e7a170490f0fbb1b0d8a59e/sched/clock/clock
_systimespec.c?at=master&fileviewer=file-view-
default#clock_systimespec.c-149Â . Even if the variable holding
milliseconds (confusingly named "usecs") is 64-bit long, the
calculations would be 32-bit anyway (I checked on ARM Cortex-M3).
Despite the information in nuttx/sched/Kconfig, the tick timer (with
default 100Hz tick frequency) overflows after just 497 days of uptime,
not 13.5 years.
config SYSTEM_TIME64
bool "64-bit system clock"
default n
---help---
The system timer is incremented at the rate determined
by
USEC_PER_TICK, typically at 100Hz. The count at any
given time is
then the "uptime" in units of system timer ticks.  By
default, the
system time is 32-bits wide.  Those defaults provide a
range of about
13.6 years which is probably a sufficient range for
"uptime".
However, if the system timer rate is significantly
higher than 100Hz
and/or if a very long "uptime" is required, then this
option can be
selected to support a 64-bit wide timer.
The bigger issue here is that enabling this option is pointless... Yes,bool "64-bit system clock"
default n
---help---
The system timer is incremented at the rate determined
by
USEC_PER_TICK, typically at 100Hz. The count at any
given time is
then the "uptime" in units of system timer ticks.  By
default, the
system time is 32-bits wide.  Those defaults provide a
range of about
13.6 years which is probably a sufficient range for
"uptime".
However, if the system timer rate is significantly
higher than 100Hz
and/or if a very long "uptime" is required, then this
option can be
selected to support a 64-bit wide timer.
the ticks will be accumulated in a 64-bit variable, but all code base
return (uint32_t)(g_system_timer & 0x00000000ffffffff);
There is clock_systimer64(), which does the right thing, but this isused in exactly one place in nuttx - in fs/procfs/fs_procfsuptime.c.
So you actually cannot change system timer to be 64-bit long -
everywhere this value is just truncated to 32-bits. But even if that
worked fine, then it wouldn't change much - almost everywhere the value
returned by clock_systimer() function is saved in a 32-bit variable.Â
mq_timed*() functions are just some problematic places, there are
other:
- all synchronization functions with timeout (for semaphores, mutexes,
condition variables, ...),
- POSIX timers,
- delay functions (sleep(), nanosleep(), ...),
- FreeMODBUS timeout functions,
- ...
In an industrial environment - like the one our project is in - uptime
of 49.7 or even 497 days is not a problem, but after that time nuttx
just fails to work reliably.
What are your suggestions?
Regards,
FCh
BTW - in clock_systimespec.c there's probably a bug around line 149.
      usecs = TICK2MSEC(clock_systimer());
      secs  = usecs / USEC_PER_SEC;
As usecs has milliseconds, secs value will be completely wrong.      secs  = usecs / USEC_PER_SEC;