Discussion:
[nuttx] Very serious problem with tick counter overflows
Freddie Chopin freddie_chopin@op.pl [nuttx]
2016-01-21 14:23:53 UTC
Permalink
Hello Greg!

I would like to report a problem in nuttx and also ask you about
possible solutions. The solution we're hoping on is a complete fix in
nuttx, which would probably cause some API changes. Otherwise we would
have to introduce and maintain our own fixes in nuttx, implement
numerous workarounds or completely replace RTOS in this project.

The problem I would like to report is extremely serious and caused the
project we are working on to just silently loose it's main function
after some time. This is far worse than a crash, because on the outside
it all seemed to work perfectly fine, but the main functionality of the
project was not executed.

The problem was discovered in a rather old version of nuttx - we
were using something in between 7.2 and 7.3, but that doesn't matter -
current master still has the same issue.

I'll start with the description of the original issue we had - please
note that this applies to nuttx version 7.2 - in current master the
call path is a bit different.

Our project uses standard 100Hz tick frequency.

We used mq_timedwait() and mq_timedreceive() with a timeout value "10ms
from last timeout" and the timeout condition was essential in the
application. This worked pretty nice, but after about 49 days and 17
hours of continuous uptime the timeout was not detected at all. This is
a short description, but believe me that finding a problem like that is
extremely hard - it was over half a year since the day of the first
report (the first time someone noticed something is wrong) till the day
of finding the root cause of the problem. Fortunately in that time no
expensive machinery was fried to ashes.

It turns out that mq_timedwait() and mq_timedreceive() both
called clock_abstime2ticks() to convert the absolute timeout (timespec)
to number of ticks since system start. This function
called clock_gettime(), which just returned current time. The real
problem was there in line 177 ( https://bitbucket.org/patacongo/nuttx/s
rc/681a5bb37a25659eaee38be73b9b5bce1d0a2e5b/sched/clock_gettime.c?at=nu
ttx-7.2&fileviewer=file-view-default#clock_gettime.c-177 ). The
conversion of "ticks since system start" to milliseconds overflows just
after (2^32)/10 system ticks, which is about 49 days and 17 hours (with
100Hz tick frequency). When you request a timeout around this moment,
you will get some ridiculous value - in our case instead of "10ms from
last timeout" we got something like "40 days from last timeout".
Because of this, there was no timeout reported and the application
stopped behaving like it should, but without any other symptom of the
issue - mq_timedreceive()/mq_timedsend() just didn't return with
ETIMEOUT when they should.

In current master the call path is different, but I believe that it has
the same problem, but now probably here - https://bitbucket.org/patacon
go/nuttx/src/eb44ef2bcde9692d0e7a170490f0fbb1b0d8a59e/sched/clock/clock
_systimespec.c?at=master&fileviewer=file-view-
default#clock_systimespec.c-149 . Even if the variable holding
milliseconds (confusingly named "usecs") is 64-bit long, the
calculations would be 32-bit anyway (I checked on ARM Cortex-M3).

Despite the information in nuttx/sched/Kconfig, the tick timer (with
default 100Hz tick frequency) overflows after just 497 days of uptime,
not 13.5 years.
config SYSTEM_TIME64
bool "64-bit system clock"
default n
---help---
The system timer is incremented at the rate determined
by
USEC_PER_TICK, typically at 100Hz. The count at any
given time is
then the "uptime" in units of system timer ticks.  By
default, the
system time is 32-bits wide.  Those defaults provide a
range of about
13.6 years which is probably a sufficient range for
"uptime".
However, if the system timer rate is significantly
higher than 100Hz
and/or if a very long "uptime" is required, then this
option can be
selected to support a 64-bit wide timer.
The bigger issue here is that enabling this option is pointless... Yes,
the ticks will be accumulated in a 64-bit variable, but all code base
return (uint32_t)(g_system_timer & 0x00000000ffffffff);
There is clock_systimer64(), which does the right thing, but this is
used in exactly one place in nuttx - in fs/procfs/fs_procfsuptime.c.

So you actually cannot change system timer to be 64-bit long -
everywhere this value is just truncated to 32-bits. But even if that
worked fine, then it wouldn't change much - almost everywhere the value
returned by clock_systimer() function is saved in a 32-bit variable. 

mq_timed*() functions are just some problematic places, there are
other:
- all synchronization functions with timeout (for semaphores, mutexes,
condition variables, ...),
- POSIX timers,
- delay functions (sleep(), nanosleep(), ...),
- FreeMODBUS timeout functions,
- ...

In an industrial environment - like the one our project is in - uptime
of 49.7 or even 497 days is not a problem, but after that time nuttx
just fails to work reliably.

What are your suggestions?

Regards,
FCh

BTW - in clock_systimespec.c there's probably a bug around line 149.
      usecs = TICK2MSEC(clock_systimer());
      secs  = usecs / USEC_PER_SEC;
As usecs has milliseconds, secs value will be completely wrong.
Gregory Nutt spudarnia@yahoo.com [nuttx]
2016-01-21 16:12:38 UTC
Permalink
I will look at the use of clock_systimer and the abstime calculations to
see if I can improve the range.

Greg
'David S. Alessio' david.s.alessio@gmail.com [nuttx]
2016-01-21 17:13:35 UTC
Permalink
I believe the fix for this is easy. It’s OK to use unsigned ints for timer values as long as the difference (any subtraction) between two values is calculated using signed ints. i.e.

177: msecs = MSEC_PER_TICK * (g_system_timer - g_tickbias);

should be:

msecs = MSEC_PER_TICK * (unsigned) ((signed)g_system_timer - (signed)g_tickbias);


Using signed subtraction will always produce the correct result, even across an overflow.


Cheers,
-david
Post by Gregory Nutt ***@yahoo.com [nuttx]
I will look at the use of clock_systimer and the abstime calculations to
see if I can improve the range.
Greg
Sebastien Lorquet sebastien@lorquet.fr [nuttx]
2016-01-21 17:15:16 UTC
Permalink
Hello,

yes, this is a mind blowing feature of subtraction.

I could not believe it the first time I was shown this!

sebastien
Post by 'David S. Alessio' ***@gmail.com [nuttx]
I believe the fix for this is easy. It’s OK to use unsigned ints for timer
values as long as the difference (any subtraction) between two values is
calculated using signed ints. i.e.
177: msecs = MSEC_PER_TICK * (g_system_timer - g_tickbias);
msecs = MSEC_PER_TICK * (unsigned) ((signed)g_system_timer - (signed)g_tickbias);
Using signed subtraction will always produce the correct result, even across an overflow.
Cheers,
-david
Post by Gregory Nutt ***@yahoo.com [nuttx]
I will look at the use of clock_systimer and the abstime calculations to
see if I can improve the range.
Greg
spudarnia@yahoo.com [nuttx]
2016-01-21 18:22:49 UTC
Permalink
I don't think that the signed is required either. Unsigned integers also always produce the correct difference.


For example if the older time were 0xfffffffe, then 4 clock ticks elapse. The current time then rolls over to 0x00000002 and the difference of 0x00000002 - 0xfffffffe is 0x00000004.


Two's complement arithmetic. To negate a number, you take the complement and add one:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0ahUKEwijyPCywrvKAhXG7B4KHavCAd8QFggkMAI&url=http%3A%2F%2Fwww.cs.fsu.edu%2F~hawkes%2Fcda3101lects%2Fchap4%2Fnegation.html&usg=AFQjCNEuw5e0SRsZZvxuJtJ8PNMiSN8faQ&sig2=-m2baIOTYxcaVy3rOohOtA


So -0xfffffffe is equal to the one's complement + 1 or 0x00000001 + 1 = 0x00000002


That is true whether the value is signed or not. This is just the meaning of negation in two's complement system.


So


0x00000002 - 0xfffffffe = 0x00000002 + 0x00000002 = 0x00000004


I have been relying on this fact for decades.


I don't see any problem with the existing code. Or perhaps I am missing something and you need to explain a little more. Can you give any example where the unsigned arithmetic fails due to an overflow?


Greg
Freddie Chopin freddie_chopin@op.pl [nuttx]
2016-01-21 19:25:49 UTC
Permalink
I don't see any problem with the existing code.  Or perhaps I am
missing something and you need to explain a little more.  Can you
give any example where the unsigned arithmetic fails due to an
overflow?
The problem with both versions of the code was the the multiplication
was done using uint32_t type, even if you assigned the result to
uint64_t variable. This way (in case of typical 100Hz tick frequency)
multiplying anything above 429496729 ((2^32)/10, about 49 days and 17
hours) by the value of milliseconds-per-tick (10) would produce an
invalid result.

429496729 * 10 = 4294967290 = 0xfffffffa <- correct429496730 * 10 = 4294967300 = 0x100000004 -> 0x4 when clipped to
uint32_t <- invalid
After this overflow invalid "current time" value is returned. Instead
of "49 days, 17 hours and a little more" the value is "0 days, 0 hours
and a little more". Now when the clock_abstime2ticks() calculates
relative difference between the absolute timeout deadline and current
time it would give me "49 days, 17 hours and a little more" instead of
the 10 milliseconds that I requested.

I actually think that 64-bit tick timer should be a default if not the
only option - 32-bit counter has too little range to be reliable.

Anyway - now that it is possible to use 64-bit timer, you should also
introduce critical zone around reading the value of this 64-bit
variable - the access to 8-byte long value is not atomic on any
architecture supported by Nuttx. Some call sequences that lead
to clock_systimer() do have interrupts disabled, but not all of them.
At least this was the case before your changes from last hours.

Here's one typo:
https://bitbucket.org/patacongo/nuttx/src/5a9a1b53ef6bebf5d5d062e57307e
6ef1185874b/sched/clock/clock_systimer.c?at=master&fileviewer=file-
view-default#clock_systimer.c-81
Regards,
FCh
Freddie Chopin freddie_chopin@op.pl [nuttx]
2016-01-22 17:35:07 UTC
Permalink
Post by Freddie Chopin ***@op.pl [nuttx]
Anyway - now that it is possible to use 64-bit timer, you should also
introduce critical zone around reading the value of this 64-bit
variable - the access to 8-byte long value is not atomic on any
architecture supported by Nuttx. Some call sequences that lead
to clock_systimer() do have interrupts disabled, but not all of them.
At least this was the case before your changes from last hours.
Ping? This is a very important problem with 64-bit counter. For example
the most basic time() function will cause the 64-bit variable to be
read with interrupts enabled, which could lead to erroneous values.

Regards,
FCh
spudarnia@yahoo.com [nuttx]
2016-01-22 18:26:08 UTC
Permalink
Thank. This should fix that:

https://bitbucket.org/patacongo/nuttx/commits/5c4c572d551eeb511f6d97fb3d3d931f5b257810

Greg
Freddie Chopin freddie_chopin@op.pl [nuttx]
2016-01-22 18:46:56 UTC
Permalink
Post by ***@yahoo.com [nuttx]
https://bitbucket.org/patacongo/nuttx/commits/5c4c572d551eeb511f6d97f
b3d3d931f5b257810
Thanks!

Regards,
FCh

spudarnia@yahoo.com [nuttx]
2016-01-21 18:37:59 UTC
Permalink
I have just committed a change that changes the definitions in include/nuttx/clock.h. Now there is only one available version of clock_systimer(). If the 64-bit timer is selected, it always returns 64-bits; the 32-bit timer is selected, it always returns 32-bits. There is a new type, systime_t that is used instead of uint32_t to catch the system time. It is typedef'ed to uint64_t if the 64-bit timer is selected. Otherwise, it is typdef'ed to uint32_t.


The introduction of the new type did cause a lot of changes to files and, most likely, a few typos.


I will look more at the time conversions now.


Greg
spudarnia@yahoo.com [nuttx]
2016-01-21 19:27:27 UTC
Permalink
.. The real
problem was there in line 177 ( https://bitbucket.org/patacongo/nuttx/s
rc/681a5bb37a25659eaee38be73b9b5bce1d0a2e5b/sched/clock_gettime.c?at=nu
ttx-7.2&fileviewer=file-view-default#clock_gettime.c-177 ). The
conversion of "ticks since system start" to milliseconds overflows just
after (2^32)/10 system ticks, which is about 49 days and 17 hours (with
100Hz tick frequency).
I don't think this is caused by any tricky arithmetic problems.

uint32_t msecs;
msecs = MSEC_PER_TICK * (g_system_timer - g_tickbias);

MSEC_PER_TICK is 10 in your case, so as soon as the difference exceeds 2^32 / 10, the calculation will overflow simply because the value is too large to represent in a uin32_t. The difference is correct; the scaling causes the overflow.

2^32 / 10 / 100 Hz = 49.7 days.

That could probably be fixed just by making msecs a uint64_t. But that is all moot since that logic that you are referring to is no longer in the system

The line corresponding to your overflow in the current code is (now):

uint64_t usecs;
usecs = (uint64_t)TICK2USEC(clock_systimer());

or

#ifdef CONFIG_HAVE_LONG_LONG
uint64_t msecs;
#else
systime_t msecs;
#endif
msecs = TICK2MSEC(clock_systimer());

The latter code is the identical logic that was overflowing in your case. The uint64_t should eliminate the overflow.

I also found un unrelated error in the time conversion in the current code:

sched/clock/clock_systimespec.c
- usecs = TICK2MSEC(clock_systimer());
+ usecs = (uint64_t)TICK2USEC(clock_systimer());

Should be TICK2USEC to convert to microseconds.
The bigger issue here is that enabling this option is pointless... Yes,
the ticks will be accumulated in a 64-bit variable, but all code base
in nuttx (with exactly one insignificant exception) uses clock_systimer()
function/macro to get the number of ticks. ...
As I mentioned, I changed the definitions in nuttx/clock.h. There is no real clock_systimer() anymore. All implementations were renamed clock_systimer32().

I added this definition in clock.h:

/* Select the access to the system timer using its natural with */

#ifdef CONFIG_SYSTEM_TIME64
# define clock_systimer() clock_systimer64()
#else
# define clock_systimer() clock_systimer32()
#endif

So now any user of clock_systimer() will get the currently selected clock width.

Also

/* This type is the natural with of the system timer */

#ifdef CONFIG_SYSTEM_TIME64
typedef uint64_t systime_t;
#else
typedef uint32_t systime_t;
#endif

And I changed the uint32_t that held values of clock_systimer() to systime_t. So the 64-bit timer will be used everywhere (making the RAM usage a little more).

Greg
Loading...