Discussion:
Problem with Ethernet drivers
(too old to reply)
averyanovin@yahoo.com [nuttx]
2015-09-09 13:31:54 UTC
Permalink
I get assertion in Ethernet drivers if CONFIG_NET_NOINTS not select.
Assertion take on lpc43_ethernet.c line: 1255
and its take when i call sendto for udp socket, and it's take if i ping board before sending udp packet.


If i not ping board it's get Assertion when code sending/reciving arp.
And it's happens on lpc43_ethernet.c line: 1520


If CONFIG_NET_NOINTS defined. All works perfectly.


Some one can test on stm32?



P.S. If i build with optimization it's assertion is not immediately. If optimization disable i
spudarnia@yahoo.com [nuttx]
2015-09-09 13:38:20 UTC
Permalink
You should always have CONFIG_NET_NOINTS defined if your driver supports the option. The original, old uIP style drivers worked at the interrupt level. But those are deprecated. Someday, when all drivers are converted, I will remove CONFIG_NET_NOINTS and that will be required for all network drivers.


See
https://bitbucket.org/patacongo/nuttx/src/master/TODO#TODO-913


Greg
averyanovin@yahoo.com [nuttx]
2015-09-15 17:46:05 UTC
Permalink
It seems that the problem persists even with enabled CONFIG_NET_NOINTS.
But there is much less common.
Can somebody check the work the network at high load(many data sending)?
More interested in stm32 or LPC43.

If CONFIG_NET_NOINTS sending can hang or assertions happen in semaphore code. It seems like memory corruption
spudarnia@yahoo.com [nuttx]
2015-09-15 21:53:26 UTC
Permalink
Are other people seeing problems like this? Or is this just with the configuration of this person?


I assume that latter..


I suspect that you need to check your stack sizes and/or lower the optimization level used by your toolchain.


Greg
averyanovin@yahoo.com [nuttx]
2015-09-15 22:23:12 UTC
Permalink
Tomorrow I will do a few more tests. While I can say that the problem is not in the stack. His precise enough. If optimization is turned off, then things get worse.
But I do more tests with different levels of optimization. Tomorrow will put configuration.
averyanovin@yahoo.com [nuttx]
2015-09-16 08:35:53 UTC
Permalink
New details.
if disable CONFIG_ARMV7M_USEBASEPRI seems all work fine(40 min work without hang and assertion)
If disable optimization(set Suppress Optimization) and enable (CONFIG_ARMV7M_USEBASEPRI) I get assertion in many different sections of code.
up_assert: Assertion failed at file:armv7-m/up_hardfault.c line: 183 task: hpwork
up_assert: Assertion failed at file:sched/sched_removeblocked.c line: 97 task: init
up_assert: Assertion failed at file:mm_heap/mm_free.c line: 137 task: hpwork
up_assert: Assertion failed at file:armv7-m/up_unblocktask.c line: 78 task: init
And with this setup I get assertion almost immediately.
averyanovin@yahoo.com [nuttx]
2015-09-16 12:42:05 UTC
Permalink
New details.
(lpc43/stm32)_dopoll may be called again until the previous copy is not completed.
spudarnia@yahoo.com [nuttx]
2015-09-16 13:21:48 UTC
Permalink
That suggests that you might be trying to prioritize interrupts. You don't want to do that! That will causes crashes unless you are configured in a certain way. You must always have:

# CONFIG_ARCH_IRQPRIO is not set

In you .config file or random crashes will occur. This is a normal consequence of the misuse of that setting. See https://groups.yahoo.com/neo/groups/nuttx/conversations/topics/5647

Greg
averyanovin@yahoo.com [nuttx]
2015-09-16 13:46:27 UTC
Permalink
ARCH_IRQPRIO [=n]
Ilya Averyanov averyanovin@yahoo.com [nuttx]
2015-09-16 14:13:04 UTC
Permalink
My config.
Post by ***@yahoo.com [nuttx]
ARCH_IRQPRIO [=n]
averyanovin@yahoo.com [nuttx]
2015-09-18 11:35:33 UTC
Permalink
It seems that the problem occurs in a situation when we send the package, we also get the package. It seems that the problem around with work_queue. If disable the processing of receiving messages all works
spudarnia@yahoo.com [nuttx]
2015-09-18 13:35:10 UTC
Permalink
Sounds like you have a problem with the Ethernet DMA configuration. Interrupt processing must be disabled when work is scheduled on the work queue. So there will be no receive processing by the CPU, but the receive DMA will continue even with interrupt disabled. Memory corruption due to a bad DMA would be a good guess.


Greg
averyanovin@yahoo.com [nuttx]
2015-09-18 15:22:39 UTC
Permalink
I'm a little are mixed messages enabled CONFIG_NET_NOINTS and disabled.
No more correct.

( ) CONFIG_NET_NOINTS - was disable. No work queue.
I describe the bad scenario.
We call sendto
after some calls we call netdev_txnotify_dev() and as result we call lpc43_dopoll in this function we call
devif_poll now call udp_poll -> udp_callback -> devif_conn_event -> net_lock -> _net_takesem and we go to rescheduling. Some another thread exec. And we get interrupt by ethernet. go to lpc43_dopoll and get asset because we call dopoll twice.

(*) CONFIG_NET_NOINTS - was enabled. Have work_queue.
I'm incomplete investigation, but there are not a very good scenario.

We have only one instance for work_queue
struct work_s work; /* For deferring work to the work queue */

We add to work_queue something from thread(calling send to and go to lpc43_txavail and adding to queue) . Now we got interrupt. And we adding in queue new function. But pointer to function store in struct work_s.
priv->work share for all this queue.

When we call work_queue(HPWORK, &priv->work, lpc43_interrupt_work, priv, 0);
we replace function from previous call
But there is no replacement. We last new adding worker. because in work_process we
zeroize worker(work->worker = NULL;)
Therefore In work_s can only be one element.
averyanovin@yahoo.com [nuttx]
2015-09-18 15:37:45 UTC
Permalink
The main problem is call dopoll, from "interrupt" level(or interrupt + work queue) and from user thread(user thread + work queue)
spudarnia@yahoo.com [nuttx]
2015-09-18 15:58:14 UTC
Permalink
Post by ***@yahoo.com [nuttx]
The main problem is call dopoll, from "interrupt" level(or interrupt + work queue) and from user thread(user thread + work queue)
Yes, you cannot do that. That will cause crashes. I did not implement the lpc43 Ethernet driver, but it it behaves that way. It is incorrect. Look at drivrs/net/skeleton.c line 990 for the correct way. Everything must be done on the work queue -- including all polls.


You cannot mix the logic. CONFIG_NET_NOINTS must be selected and everything must use the work queue. Do not disable CONFIG_NET_NOINTS. And if you do, do not use the work queue. Do not mix.


You will need to fix that Ethernet driver so that it works like the skeleton.c example code.



Greg
averyanovin@yahoo.com [nuttx]
2015-09-18 16:03:42 UTC
Permalink
lpc43 same as stm32 if replace in lpc43 driver all lpc43 name to stm32 we got same drivers.
(Well, some registers are different by one letter. But these drivers are twin brothers). All I'm talking about the driver lpc43xx/Ethernet same for stm32/Ethernet.
And before stm32_eth was rewrite to work_queue it's work perfect.
spudarnia@yahoo.com [nuttx]
2015-09-18 19:46:10 UTC
Permalink
I am not aware of any problems with the STM32 Ethernet driver.
alexander.vasiljev@yahoo.com [nuttx]
2015-09-22 11:25:33 UTC
Permalink
Hi,

One of the problems with the LPC43 Ethernet driver is that the memory pointers for DMA receive are rewritten in the recvframe function. Here is patch to show the problem:

arm/src/lpc43xx/lpc43_ethernet.c
@@ -1514,34 +1529,37 @@ static int lpc43_recvframe(FAR struct lpc43_ethmac_s *priv)

/* Take the buffer from the RX descriptor of the first free
* segment, put it into the uIP device structure, then replace
* the buffer in the RX descriptor with the newly allocated
* buffer.
*/

DEBUGASSERT(dev->d_buf == NULL);
- dev->d_buf = (uint8_t*)rxcurr->rdes2;
- rxcurr->rdes2 = (uint32_t)buffer;
+ dev->d_buf = buffer;
+ memcpy(buffer, (uint8_t*)rxcurr->rdes2, dev->d_len);

The issue is we have memory for DMA receive initialised in init function, so we shouldn't rewrite the pointers. And we must not set memory from lpc43_allocbuffer to rdes2, because this memory will be freed and used for other purposes, and must not be assigned to DMA.

---In ***@yahoogroups.com, <***@...> wrote :

I am not aware of any problems with the STM32 Ethernet driver.
alexander.vasiljev@yahoo.com [nuttx]
2015-09-22 11:37:22 UTC
Permalink
Hi,

While debugging I face with another issue in LP43 Ethernet driver. I regulary find the init thread and the ehternet interrupt call stack inside irqsave simultaneously.

Interrupt call stack is in sem_waitirq. It fails at DEBUGASSERT(sem != NULL && sem->semcount < 0);

Init thread is in net_timedwait. It makes sem_post(&g_netlock);

How can this happen? Both are inside irqsave. One is interrupt. Other is with sched_lock on.
spudarnia@yahoo.com [nuttx]
2015-09-23 13:36:17 UTC
Permalink
Interrupts are automatically re-enabled when a thread goes to sleep. It would, of course, halt the system completely if that were otherwise.


Greg
alexander.vasiljev@yahoo.com [nuttx]
2015-09-23 15:22:17 UTC
Permalink
Unfortunately there is no sleep, no even semaphore between irqsave and assert. I still getting BASEPRI set to zero inside irqsave.
alexander.vasiljev@yahoo.com [nuttx]
2015-09-24 16:39:00 UTC
Permalink
It seems that killing HPWORK too frequent kills HPWORK. Several sources can queue works to HPWORK (Ethernet interrupt, timer interrupt, init thread). But the kill function is not guaranteed to be consistent with concurrent calls. up_schedule_sigaction function refuses to handle nested signal actions.
1) Init thread calls work_queue. work_queue calls kill.
2) Interrupt halts init thread. Interrupt calls work_queue. work_queue calls kill.
3) Init thread resumes execution somewhere in sig_tcbdispatch, but the HPWORK thread already has another state.

Does it look reasonable?
spudarnia@yahoo.com [nuttx]
2015-09-24 19:21:30 UTC
Permalink
The only purpose of the kill() is to wake up the delay in the worker thread so that it will reassess the delay based on the current work queue. It does not kill anything and should be harmless in any event. It does not need to be syncrhronized with the work queue in any way and should have no effect other than waking up the thread if it is waiting.


What do you suspect is wrong? I only see this normal behavior.


Greg
alexander.vasiljev@yahoo.com [nuttx]
2015-09-25 09:30:19 UTC
Permalink
I am telling that the kill() is not reentrant. If several threads call the work_queue() the kill() crushes OS.

Here is my modification to the work_queue(). Now I don't see OS crushes (but i keep on testing).

static int g_workq_kill = 0;
int work_queue(int qid, FAR struct work_s *work, worker_t worker,
FAR void *arg, uint32_t delay)
{
#ifdef CONFIG_SCHED_HPWORK
if (qid == HPWORK)
{
/* Cancel high priority work */
int ret = OK;
work_qqueue((FAR struct kwork_wqueue_s *)&g_hpwork, work, worker, arg, delay);
g_workq_kill++;
if(g_workq_kill == 1)
ret = work_signal(HPWORK);
g_workq_kill--;
return ret;
}
spudarnia@yahoo.com [nuttx]
2015-09-25 13:15:33 UTC
Permalink
I am not aware of any re-entrancy issues with work_signal(), kill() or any of the functions called by kill(). You would have to point me to the root cause of what you believe is the re-entrancy issue for me to have an opinion.


greg
alexander.vasiljev@yahoo.com [nuttx]
2015-09-25 16:48:40 UTC
Permalink
Let's look at sig_tcbdispatch().
up_schedule_sigaction() and up_unblock_task() are not guarded.

Some thread can call up_schedule_sigaction().
Then some Interrupt can call up_schedule_sigaction() again but really don't do anything because up_schedule_sigaction() doesn't allow it.
Then the interrupt call up_unblock_task().
Then the thread call up_unblock_task().
Then the thread goes to ched_unlock()
sched_unlock() call irqsave().
Then call up_release_pending().
up_release_pending() makes switch context. (irq is still saved in sched_unlock())
What i got is ASSERT(rtcb->xcp.sigdeliver != NULL) in up_sigdeliver().

What is about switching context with irqsave? Is it save?
spudarnia@yahoo.com [nuttx]
2015-09-25 17:07:01 UTC
Permalink
Interrupts must always be disabled when up_unblock_task() is called. Is there some place where up_unblock_task() is called with interrupts enabled? There should not be.


Greg
alexander.vasiljev@yahoo.com [nuttx]
2015-09-25 17:18:25 UTC
Permalink
up_unblock_task() is not called with interrupts enabled anywhere.
spudarnia@yahoo.com [nuttx]
2015-09-25 18:01:10 UTC
Permalink
Post by ***@yahoo.com [nuttx]
up_unblock_task() is not called with interrupts enabled anywhere.
...
Then the interrupt call up_unblock_task().
Then the thread call up_unblock_task().
...
What i got is ASSERT(rtcb->xcp.sigdeliver != NULL) in up_sigdeliver().
What signal handler would this be? The work queue does not use a signal handler. As I said earlier, the only purpose of signaling the working thread is wake up the delay. No signal handling is involved.


I suspicion is still as it was at the beginning of this long thread. You probably have something wrong in your Ethernet driver (or some other code) that is trashing memory ... a bad buffer access, and bad DMA, or something like that. The resulting memory corruption trashes operating system data structures and causes crazy results. This can be very difficult to debug because following the symptoms of the failure never leads to any interpretable result. The only way to solve such problems is to find the root cause of the memory corruption.


In fact, I have see that assertion fire in up_sigdeliver under just such conditions of memory corruption. It is a common failure when something else wrong and I have never seen it have anything to do with signaling or any other error in the OS.


Greg
averyanovin@yahoo.com [nuttx]
2015-09-28 15:39:56 UTC
Permalink
STM32 drivers have problem too.
But it looks differently.

If we send an average of 100 pps from PC, then stm32 stop sending.(Thread with function sendto waiting semaphore)
averyanovin@yahoo.com [nuttx]
2015-09-28 16:41:00 UTC
Permalink
Post by ***@yahoo.com [nuttx]
You probably have something wrong in your Ethernet driver (or some other code) that is trashing memory ... a bad buffer access, and bad DMA, or something like that.
No problem with DMA and another trashing memory. It looks like a race condition.
If this were the memory corruption then we get full trash tcb struct. but we have only incorrect state.
Some time we have tcb.flink = tcb
spudarnia@yahoo.com [nuttx]
2015-09-28 17:42:44 UTC
Permalink
Certainly it is always possible for race condition bugs to exist. The problems you have described to me thusfar are all clearly due to memory corruption, however. I have seen no evidence yet for a race condition.
spudarnia@yahoo.com [nuttx]
2015-09-28 17:49:48 UTC
Permalink
TCP, right? This is not due to any kind of race condition. This is a buffering and data overrun problem. You probably have both TCP read-ahead and writ buffering enabled. If you bombard the target with packets at a higher rate than it can process them, then all of the buffers will get committed to holding read-ahead data.


Then, when you try send, there with write buffering, there will not be sufficient buffers available to perform he send. You would have to read some of the data in order to free up buffering.


The solution is to adjust some of the buffering configuration. In such a case, there is no reason to enable read ahead buffering at all. It can't do you any good.


There is also a setting called CONFIG_IOB_THROTTLE that you can tinker with. That is a value that will "throttle" request for IO buffers for read-ahead, reserving a certain number so that you can continue to send. You might try to increase this value.


Greg
spudarnia@yahoo.com [nuttx]
2015-09-28 17:59:56 UTC
Permalink
Another person reported a data overrun problem to me. In a situation similar to what you describe, the host attempts to send data to the target at a much faster rate than the target can receive. In that case, this person was seeing missing data in the TCP stream received by the target.


So I do suspect some issues with buffering and perhaps also with the handling of ACKs and retransmit requests from the target in read-ahead data overrun conditions. The buffering issues are a normal consequence of the configuration and I don't have any hard evidence other that this one report to suggest there is an ACK/re-transmit problem or not.


ACK/re-transmit is a terrible solution to a data pacing problem in any case.


Greg
averyanovin@yahoo.com [nuttx]
2015-09-28 18:06:15 UTC
Permalink
No we use UDP for both side(PC and microcontrollers)

In microcontrollers we have this code.
int test_main(int argc, char *argv[])
{
setMacAddr("\x0\x60\x37\x0\x0\x1");
setMask("255.255.255.0");
setAddr("192.168.0.99");

int sockfd;
struct sockaddr_in server;
socklen_t addrlen;
server.sin_family = AF_INET;
server.sin_port = htons(5000);
server.sin_addr.s_addr = inet_addr("192.168.0.1");;

addrlen = sizeof(struct sockaddr_in);
sockfd = socket(PF_INET, SOCK_DGRAM, 0);
uint8_t data[] = "long long string leth about one eth frame(1400)";
while(1)
{

sendto(sockfd, data, sizeof(data), 0,
(struct sockaddr*)&server, addrlen);
}
}
On PC same with just adding rate control.
Brennan Ashton bashton@brennanashton.com [nuttx]
2015-09-28 18:46:06 UTC
Permalink
I have also seen some issues with the STM32 Ethernet driver with bursty
traffic, but it comes up with the failure of DEBUGASSERT(priv->dev.d_buf !=
NULL); but I am fairly sure that it was a buffer overflow issue somewhere
higher up in the network stack, and not a driver level issue.

--Brennan
Post by ***@yahoo.com [nuttx]
No we use UDP for both side(PC and microcontrollers)
In microcontrollers we have this code.
int test_main(int argc, char *argv[])
{
setMacAddr("\x0\x60\x37\x0\x0\x1");
setMask("255.255.255.0");
setAddr("192.168.0.99");
int sockfd;
struct sockaddr_in server;
socklen_t addrlen;
server.sin_family = AF_INET;
server.sin_port = htons(5000);
server.sin_addr.s_addr = inet_addr("192.168.0.1");;
addrlen = sizeof(struct sockaddr_in);
sockfd = socket(PF_INET, SOCK_DGRAM, 0);
uint8_t data[] = "long long string leth about one eth frame(1400)";
while(1)
{
sendto(sockfd, data, sizeof(data), 0,
(struct sockaddr*)&server, addrlen);
}
}
On PC same with just adding rate control.
averyanovin@yahoo.com [nuttx]
2015-09-28 19:26:04 UTC
Permalink
DEBUGASSERT(priv->dev.d_buf != NULL); also may be if CONFIG_NET_NOINTS not select.
averyanovin@yahoo.com [nuttx]
2015-09-29 13:02:14 UTC
Permalink
Greg you say
Post by ***@yahoo.com [nuttx]
You should always have CONFIG_NET_NOINTS defined if your driver supports the option. The original, old uIP style drivers worked at the interrupt level. But those are deprecated.
But CONFIG_NET_NOINTS terrible performance.
On STM32F4 with disable CONFIG_NET_NOINTS I get around 100 Mb(without optimisation). With enable CONFIG_NET_NOINTS max speed around 52 Mb with full optimisation and 30 Mb without optimisation.
averyanovin@yahoo.com [nuttx]
2015-09-29 13:08:42 UTC
Permalink
STM32 driver get assertion with CONFIG_NET_NOINTS enable
up_assert: Assertion failed at file:armv7-m/up_sigdeliver.c line: 100 task: hpwork
up_dumpstate: sp: 100016d0
up_dumpstate: stack base: 10001868
up_dumpstate: stack size: 00000fec
up_stackdump: 100016c0: 00000001 00000010 100017b0 0800370f 100017f8 080036d7 100003e0 100017b8
up_stackdump: 100016e0: 00000010 100003e0 00000000 00000010 100017f8 100017f0 10001820 00000002
up_stackdump: 10001700: 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 10001720: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 10001740: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 10001760: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 10001780: 00000000 00000000 00000002 10000454 10001924 00000001 00000002 10000454
up_stackdump: 100017a0: 10001924 00000001 100017b0 100003e0 00000000 0800700d 100003e0 00000000
up_stackdump: 100017c0: 00000000 00000000 20005488 00000000 00000010 10001820 00000000 20005704
up_stackdump: 100017e0: 0000353f 00000000 00000002 08006c63 00000000 00000000 000f4240 000003e8
up_stackdump: 10001800: 00000000 000000f0 20005540 00000002 20005704 0000353f 00000000 080026a9
up_stackdump: 10001820: 00000000 01312d00 e000e104 08002257 00000000 00000000 00000000 00000000
up_stackdump: 10001840: 00000000 00000000 00000000 00000000 00000000 08001b3d 08001b2d 080016ef
up_stackdump: 10001860: 00000000 00000000 ea58240a 10001874 00000000 6f777068 80006b72 af6fa0f4
Gregory Nutt spudarnia@yahoo.com [nuttx]
2015-09-29 13:39:01 UTC
Permalink
Post by ***@yahoo.com [nuttx]
Greg you say
Post by ***@yahoo.com [nuttx]
You should always have CONFIG_NET_NOINTS defined if your driver
supports the option. The original, old uIP style drivers worked at
the interrupt level. But those are deprecated.
But CONFIG_NET_NOINTSterrible performance.
On STM32F4 with disable CONFIG_NET_NOINTS I get around 100 Mb(without
optimisation). With enable CONFIG_NET_NOINTS max speed around 52 Mb
with full optimisation and 30 Mb without optimisation.
The network may be faster with CONFIG_NET_NOINTS disabled, but it is
still wrong. That option is being eliminated. And will no longer be
supported when the last driver is converted (see the TODO file). It is a
deprecated features. It is not supported and will eventually be removed
altogether.

Greg
averyanovin@yahoo.com [nuttx]
2015-09-30 09:46:00 UTC
Permalink
But how we can use work_queue if it is not working? U can test it yourself on stm32(or on SAM*)?
What hw u have?

Test is simple. U send from PC many UDP packet. And form microcontrollers in main u send udp fast as possible.

After short time(depend on many things: optimisation level, BASEPRI/PRIMASK, and other option) u get freeze on sending or assertino.
spudarnia@yahoo.com [nuttx]
2015-09-30 17:25:43 UTC
Permalink
I add a testing case that duplicates your test scenario:


https://bitbucket.org/nuttx/apps/commits/2e0d0ede6d11ee4c391cad1394081c6fd2d22bba


And I was able to duplicate the assertion that you describe. It was pretty easy to fix... there was a non-atomic test-and-modify operation in up_schedulesigaction. After this change I was able to run for hours with no issues:


https://bitbucket.org/nuttx/arch/commits/22e48d4266c8c6a8d31aefabe375dcb78173975b



I also so an occasion assertion failing in the work queue. That was cause by a similar non-atomic operation and was fixed by this:


https://bitbucket.org/patacongo/nuttx/commits/6cde2be910c7710ef6a8e490b35d851ccd115b74



I don't think that one was fatal however.


Greg

averyanovin@yahoo.com [nuttx]
2015-09-18 16:10:32 UTC
Permalink
Oh sorry about discribe in (*) CONFIG_NET_NOINTS - was enabled. Have work_queue. This not correct.
averyanovin@yahoo.com [nuttx]
2015-09-16 14:15:49 UTC
Permalink
this my .config
Continue reading on narkive:
Loading...