Author Topic: LWIP / 32F417 - how to optimise network stack memory allocation? (Read 3182 times)

peter-h · « **on:** July 07, 2022, 09:32:59 pm »

I have spent hours googling on this. As usual one digs up a vast number of forum and usenet posts, mostly with no clear conclusions. Some people are trying to optimise two session performance. I realise there is no optimal config, but there can be really bad ones and I want to make sure I don't have one of those.

This project was set up by someone else. It uses the ST port of LWIP and FreeRTOS.

What I am after is the case of optimising a single connection. Multiple sessions should still work but perhaps slowly; that's fine. The constraints are

- minimising memory usage (currently ETH is a total of ~30k and I would like to shave 10k off that).

- the ETH RX is polled from an RTOS task, at say 100Hz, not interrupt-driven (done for simplicity, and protection from an accidental or deliberate DOS situation), so maybe we need more RX buffers than usual, although 1 MTU at 100Hz is pretty respectable?

LWIP seems to use memory from two pools, defined in lwipopts.h:

- MEM_SIZE (a static block in which LWIP runs some sort of private heap; this was 10k but by inspection of how much of it gets actually used, is now 5k)
- other stuff like TCP_MSS, TCP_SND_BUF, etc and these end up elsewhere

There is also pbuf_pool_size (8x512) and others which take up 7.5k. This may be where the "other stuff" above gets allocated from.

Finally ethernetif.c has packet buffers which take up 12520 bytes but that seems separate from LWIP; there is a module called ETHIF which connects the ST ETH hardware to LWIP.

There is a lot of nontrivial stuff in this. For example I would have expected all packet buffers to be MTU size (1500) and smaller ones will be just wasted, but the code joins together smaller buffers so it still works (3 x 500 byte = 1500) and the 500 byte ones give better performance for smaller packets because you have more buffers.

I also wonder how much data copying goes on. I know the low level ETH stuff uses a dedicated DMA controller, but it looks like there is data copying (using memcpy) higher up, which ought to be done with DMA.

At the risk of creating a post which nobody responds to

I am posting my existing lwipopts.h file below

Code: [Select]

/**
  ******************************************************************************
  * @file    LwIP/LwIP_HTTP_Server_Netconn_RTOS/Inc/lwipopts.h
  * @author  MCD Application Team
  * @brief   lwIP Options Configuration.
  ******************************************************************************
*
*
*	7/7/22	PH	MEM_SIZE set to 5k (was 10k). Only ~1.5k is used. ram_heap is 0x20002930.
*
*
*
*
*
*
  */
#ifndef __LWIPOPTS_H__
#define __LWIPOPTS_H__

/**
 * NO_SYS==1: Provides VERY minimal functionality. Otherwise,
 * use lwIP facilities.
 */
#define NO_SYS                  0

/* STM32CubeMX Specific Parameters (not defined in opt.h) ---------------------*/
/* Parameters set in STM32CubeMX LwIP Configuration GUI -*/
/*----- WITH_RTOS enabled (Since FREERTOS is set) -----*/
#define WITH_RTOS 1
/*----- WITH_MBEDTLS enabled (Since MBEDTLS and FREERTOS are set) -----*/
#define WITH_MBEDTLS 1


//NC: Need for sending PING messages by keepalive
#define LWIP_RAW 1
#define DEFAULT_RAW_RECVMBOX_SIZE 4

/*-----------------------------------------------------------------------------*/

/* LwIP Stack Parameters (modified compared to initialization value in opt.h) -*/
/* Parameters set in STM32CubeMX LwIP Configuration GUI -*/

/*----- Value in opt.h for LWIP_DNS: 0 -----*/
#define LWIP_DNS 1

/* ---------- Memory options ---------- */
/* MEM_ALIGNMENT: should be set to the alignment of the CPU for which
   lwIP is compiled. 4 byte alignment -> define MEM_ALIGNMENT to 4, 2
   byte alignment -> define MEM_ALIGNMENT to 2. */
#define MEM_ALIGNMENT           4

/* MEM_SIZE: the size of the heap memory. If the application will send
a lot of data that needs to be copied, this should be set high. */
// This is used mainly for RAM PBUFs.
#define MEM_SIZE                (5*1024)

/* MEMP_NUM_PBUF: the number of memp struct pbufs. If the application
   sends a lot of data out of ROM (or other static memory), this
   should be set high. */
#define MEMP_NUM_PBUF           5 //10
/* MEMP_NUM_UDP_PCB: the number of UDP protocol control blocks. One
   per active UDP "connection". */
#define MEMP_NUM_UDP_PCB        10 //6
/* MEMP_NUM_TCP_PCB: the number of simulatenously active TCP
   connections. */
// This controls how much of the mem_size area gets filled up with http etc packets.
#define MEMP_NUM_TCP_PCB        5 //10
/* MEMP_NUM_TCP_PCB_LISTEN: the number of listening TCP
   connections. */
#define MEMP_NUM_TCP_PCB_LISTEN 5
/* MEMP_NUM_TCP_SEG: the number of simultaneously queued TCP
   segments. */
#define MEMP_NUM_TCP_SEG        8
/* MEMP_NUM_SYS_TIMEOUT: the number of simulateously active
   timeouts. */
#define MEMP_NUM_SYS_TIMEOUT    10


/* ---------- Pbuf options ---------- */
/* PBUF_POOL_SIZE: the number of buffers in the pbuf pool. */
#define PBUF_POOL_SIZE          8

/* PBUF_POOL_BUFSIZE: the size of each pbuf in the pbuf pool. */
#define PBUF_POOL_BUFSIZE       512


/* ---------- TCP options ---------- */
#define LWIP_TCP                1
#define TCP_TTL                 255

/* Controls if TCP should queue segments that arrive out of
   order. Define to 0 if your device is low on memory. */
#define TCP_QUEUE_OOSEQ         0

/* TCP Maximum segment size. */
#define TCP_MSS                 (1500 - 40)	  /* TCP_MSS = (Ethernet MTU - IP header size - TCP header size) */

/* TCP sender buffer space (bytes). */
#define TCP_SND_BUF             (4*TCP_MSS)

/*  TCP_SND_QUEUELEN: TCP sender buffer space (pbufs). This must be at least
  as much as (2 * TCP_SND_BUF/TCP_MSS) for things to work. */

#define TCP_SND_QUEUELEN        (2* TCP_SND_BUF/TCP_MSS)

/* TCP receive window. */
#define TCP_WND                 (2*TCP_MSS)


/* ---------- ICMP options ---------- */
#define LWIP_ICMP                       1


/* ---------- DHCP options ---------- */
#define LWIP_DHCP               1


/* ---------- UDP options ---------- */
#define LWIP_UDP                1
#define UDP_TTL                 255


/* ---------- Statistics options ---------- */
#define LWIP_STATS 0

/* ---------- link callback options ---------- */
/* LWIP_NETIF_LINK_CALLBACK==1: Support a callback function from an interface
 * whenever the link changes (i.e., link down)
 */
#define LWIP_NETIF_LINK_CALLBACK        1

#define LWIP_TCPIP_CORE_LOCKING   0

Many thanks in advance for any tips.

peter-h · « **Reply #1 on:** July 10, 2022, 06:12:00 am »

I guess nobody wants to get into this can of worms

Reading up on the way these two work

Code: [Select]

#define PBUF_POOL_SIZE             3
#define PBUF_POOL_BUFSIZE       1500 + PBUF_LINK_ENCAPSULATION_HLEN + PBUF_LINK_HLEN + PBUF_IP_HLEN + PBUF_TRANSPORT_HLEN

I found that while the previous values of

Code: [Select]

#define PBUF_POOL_SIZE             8
#define PBUF_POOL_BUFSIZE       512

do work, they cause the system to crash every few hours. This is supposed to work i.e. LWIP is supposed to concatenate PBUFs to handle packets longer than (in this case) 512, but maybe there is a bug there. Quite a lot of online examples do use the full-MTU PBUF size, even though it is less optimal for smaller packets.

The thing is, there are other bottlenecks in LWIP. For example, while the low level ETH uses a dedicated DMA (it has to otherwise 100mbps ETH would never work), higher up (in ETHIF) it uses memcpy to move the data around. Only the NXP version of LWIP gets the ETH DMA to move the data directly to LWIP
https://community.nxp.com/t5/LPCware-Archive-Content/LWIP-buffer-management/ta-p/1110512

By doing some mods I've increased the RAM available for the MbedTLS heap by 6k, which is a lot. Now I have 64k.

Now I am looking at ETH_RX_BUF_SIZE and ETH_TX_BUF_SIZE. The standard count for these is 4 of each which is 12k of RAM. Is 8 MTU-sized buffers really necessary? A lot of googling suggests a few people have questioned it but got no responses. I wonder if this is required by the 32F4 ETH hardware? Clearly not because 3 and 2 (suggested somewhere) works just fine.

It seems to me that this whole debate would hinge around how fast the RTOS gets around running the ETH task, versus how fast the ETH interface is running. But few if any 32F4 based embedded systems will be running data at a sustained 100mbps. I would bet the vast majority are far slower, and no high speed is needed. In most remote control / data acquisition applications very little data is actually moving.

tellurium · « **Reply #2 on:** July 10, 2022, 04:34:11 pm »

Note: 4 MTU-sized buffers are part of the driver, not TCP/IP stack code.
They form a FIFO to handle packet bursts.
If your firmware is faster reading the arrived buffer and marking it free again, you should be fine with only 2.
The buffer read latency should be less than time between any two arriving packets.

The HAL driver does not know how fast is your firmware, and defaults to a maximum , which is 4, as far as I remember.

peter-h · « **Reply #3 on:** July 10, 2022, 05:25:16 pm »

Quote

4 MTU-sized buffers are part of the driver, not TCP/IP stack code.

Understood.

However, would reducing this make the system unreliable?

AIUI, if the ETH end runs out of buffers, it just refuses incoming packets. So this should be a straight memory v. performance tradeoff.

In general comms, one buffer is bad because there will always be a time gap. With two buffers it is always a lot better because you have the whole packet period to extract the packet. With more than two, it is only to allow say an RTOS task to get around to extracting the whole RX queue (or most of it). But if you are doing RX under interrupt, then the case for > 2 is quite dubious... unless you allow the ISR to hang in there and loop around extracting as many packets as it can find, but that is dodgy because fast RX data will then hang the whole system.

I would also think that achieving anywhere near 100mbps sustained you would need a highly optimised system, not just FreeRTOS with 20 tasks hanging off it, LEDs flashing, debug printfs, serial IO, etc, and it switching at a leisurely 1kHz

Just the memcpy() moving packets between these buffers and the LWIP buffers will create a significant bottleneck.

Here
https://community.st.com/s/question/0D53W00001Gi9BoSAJ/ethernet-hal-driver-reworked-by-st-and-available-in-22q1-preview-now-available-on-github
they claim 94mbps but I bet that is a very simple demo system.

Quote

The HAL driver does not know how fast is your firmware, and defaults to a maximum , which is 4, as far as I remember.

It defaults to 4 for TX and 4 for RX, but that uses up 12k+ of RAM.

tellurium · « **Reply #4 on:** July 10, 2022, 09:37:27 pm »

Quote from: peter-h on July 10, 2022, 05:25:16 pm

However, would reducing this make the system unreliable?

Of course, reducing FIFO size makes system more sensitive.

As I have mentioned, everything depends on the read latency and timing between packets. If your read latency is large - and I think it is large because of polling approach, then even 4 buffers might not be enough, again that depends on the timing between packets.

For example, if an average interval between reads is 100ms, and 5 packets arrive during 50ms, then you'll loose one even with 4 buffers.

IMO, a more reliable approach is to enable ETH IRQ and copy data from buffers & release them back to DMA right in the IRQ handler. This way you can get away with 2 buffers only. If the destination (LWIP pbufs) is drained fast enough, you won't loose packets on 100Base-T LAN.

peter-h · « **Reply #5 on:** July 11, 2022, 07:48:49 am »

Does one lose packets? I am sure ETH doesn't work like that. If there aren't spare RX buffers, the low level system should refuse them, and the sender backs off and retries.

Quote

reducing FIFO size makes system more sensitive

I must be mis-understanding, because that would be true only if the low level ETH and the TCP/IP stack was crap

MbedTLS does a fair few weird things e.g. the mere opening of a UDP socket (without even using it) prevents HTTP server running while TLS is active. This was "solved" by a flag in the HTTP server to close that socket (it gets re-opened on demand later) if its client is active.

peter-h · « **Reply #6 on:** July 12, 2022, 10:31:03 am »

It amazes me how widely LWIP and FreeRTOS are used but how little help there is. The internet is full of posts about the same topics, over and over, and no answers. These are often ST ports but the ST forum is pretty much a joke, with so many desperate people posting questions that almost nobody can read it let alone answer anything. I've found an LWIP mailing list...

wek · « **Reply #7 on:** July 12, 2022, 11:47:54 am »

These are complex issues so

- "help" is expensive

- there's an infinite number of combinations and use cases, so solutions are not necessarily (read: rarely) transferrable

JW

peter-h · « **Reply #8 on:** July 12, 2022, 03:06:57 pm »

That I understand.

However, a lot of stuff is fairly basic. For example I have spent quite a lot of hours so far digging around the issue of LWIP_TCPIP_CORE_LOCKING=1 and whether it is needed for all of the socket API or just some parts, to achieve RTOS thread safety. By searching the sources for the above, it leads to It looks like one is supposed to implement some functions which invoke mutexes, but the whole topic is a mess, and I am far from the first one to be digging around it

One finds e.g.

Code: [Select]

#if LWIP_TCPIP_CORE_LOCKING
  /* core-locking can just call the -impl function */
  LOCK_TCPIP_CORE();
  err = lwip_getsockopt_impl(s, level, optname, optval, optlen);
  UNLOCK_TCPIP_CORE();

and then searching for that leads to

Code: [Select]

#define LOCK_TCPIP_CORE() sys_mutex_lock(&lock_tcpip_core)
and searching for that leads to

Code: [Select]

/* Lock a mutex*/
void sys_mutex_lock(sys_mutex_t *mutex)
{
#if (osCMSIS < 0x20000U)
  osMutexWait(*mutex, osWaitForever);
#else
  osMutexAcquire(*mutex, osWaitForever);
#endif
}

which leads to

Code: [Select]

#define osMutexWait osMutexAcquire
which leads to (in cmsys_os2.c - whatever that is)

Code: [Select]

osStatus_t osMutexAcquire (osMutexId_t mutex_id, uint32_t timeout) {
  SemaphoreHandle_t hMutex;
  osStatus_t stat;
  uint32_t rmtx;

  hMutex = (SemaphoreHandle_t)((uint32_t)mutex_id & ~1U);

  rmtx = (uint32_t)mutex_id & 1U;

  stat = osOK;

  if (IS_IRQ()) {
    stat = osErrorISR;
  }
  else if (hMutex == NULL) {
    stat = osErrorParameter;
  }
  else {
    if (rmtx != 0U) {
      if (xSemaphoreTakeRecursive (hMutex, timeout) != pdPASS) {
        if (timeout != 0U) {
          stat = osErrorTimeout;
        } else {
          stat = osErrorResource;
        }
      }
    }
    else {
      if (xSemaphoreTake (hMutex, timeout) != pdPASS) {
        if (timeout != 0U) {
          stat = osErrorTimeout;
        } else {
          stat = osErrorResource;
        }
      }
    }
  }

  return (stat);
}

which looks real enough, and leads to

Code: [Select]

#define xSemaphoreTake( xSemaphore, xBlockTime ) xQueueSemaphoreTake( ( xSemaphore ), ( xBlockTime ) )
which leads to some real code in FreeRTOS, which I happen to be using too for mutexes. So somebody has somehow connected LWIP with FreeRTOS! But who on earth writes this kind of stuff??

Unfortunately setting LWIP_TCPIP_CORE_LOCKING=1 breaks something very deep down. Initially DHCP stops working but the code is too convoluted for me to follow it.

If I had written LWIP (or whatever) I would do some appnotes on these basics. If I released something as open source I would still do appnotes. Instead, put the above into google and you just get loads of websites with the same stuff, copying each other

I've offered money to get help and still nobody is interested. If I could find a real LWIP+MbedTLS expert who uses ST Cube IDE and wants to have a go at this, I would send him a running board. Currently I have two people (mainland Europe) with boards; one has web server skills, the other various embedded skills. It works well.

ST document UM1713 contains this table

and half the internet is asking how you calculate MEM_SIZE (which, if you define MEMP_MEM_MALLOC=1) is supposed to contain (almost?) everything listed in that table. This buffer is definitely statically allocated and contains the mem_ private heap used by LWIP. But no answers on how to calculate it, even approximately. One can prefill the block and see how much gets modified...

peter-h · « **Reply #9 on:** July 14, 2022, 07:45:00 am »

This
https://www.eevblog.com/forum/microcontrollers/any-stm-32f4-eth-lwip-freertos-mbedtls-experts-here-(not-free-advice)/msg4298761/#msg4298761
may be relevant.

Otherwise, it's very hard to get input on this. I am now using 8 x 512 buffering and mem_size of 6k.

peter-h · « **Reply #10 on:** July 19, 2022, 04:01:18 pm »

I have been transferring ~2MB files to a PC and been testing the buffer numbers to see what makes any difference.

With this

#define ETH_TXBUFNB 2 // Tx buffers of size ETH_TX_BUF_SIZE

and a simple TX loop, reading a 21mbps FLASH file system

Code: [Select]

	if (f_open(&fp, fname, FA_READ | FA_OPEN_EXISTING) == FR_OK)
	{
		do
		{
			if ( f_read(&fp, pagebuf, 512, &numread) != FR_OK )
			{
				numread=0;
				break;
			}
			netconn_write(conn, pagebuf, numread, NETCONN_COPY);
			offset+=512;
		}
		while (numread==512);

		f_close(&fp);
	}

I have found that with 2 or anything bigger I am getting a tx speed of 120kbytes/sec, with 1 it falls to half that. Well, I would expect one buffer to be poor; double buffering is great. The flash file read speed (with no ETH TX) is about 5x to 10x faster so there is clearly a big bottleneck in LWIP. For this application this is fine but maybe it is indicative of something? Removing the above mentioned wait for ETH DMA to be totally finished after each transfer (making it blocking and effectively defeating the automatic transmission of chained buffers which the 32F4 ETH does; I would expect that to really slow things down) also makes no difference. Increasing MEM_SIZE in lwipopts.h dramatically (from 6k to say 16k) improves the tx speed only 20%, but despite zillions of posts nobody really knows how MEM_SIZE relates to TX buffering.

I haven't got onto file upload (ETH -> filesystem) but since the FS write speed is limited to just 30kbytes/sec I am not expecting much

Might do some tests though without writing to FLASH.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: LWIP / 32F417 - how to optimise network stack memory allocation? (Read 3182 times)

peter-h

LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

tellurium

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

tellurium

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

wek

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

peter-h

Re: LWIP / 32F417 - how to optimise network stack memory allocation?

Share me