Author Topic: Which memcpy, memcmp, strcpy and strlen function is faster? (Read 5903 times)

udok · « **on:** December 28, 2020, 07:41:11 pm »

Here are performance test results for different strlen, strcpy and memcpy functions.

I hope that this information is useful to build efficient programms.

Each of the tests runs sequentially in a loop several times, and all of the tests take a good minute.
I ran the tests with the Windows power settings set to maximum, and repeated the whole test 3 times.
The results are very reproducible, not much changes.
The fan of the notebook runs audibly, but not yet on maximum.
Only one core of the 6 cores is used as the test programs are single threaded.

The diagram shows the bytes processed / clock cycle, depending on the string length.

Regards,
Udo

EDIT: Added more diagrams and removed text.

EIDT: Added the memcmp results.

dmills · « **Reply #1 on:** December 28, 2020, 08:52:06 pm »

I would want to see the actual application profiled and showing that strlen was a significant time sink (and exactly which strlens were burning the time) before even thinking of going down that rabbit hole, and then I would want to convince myself that a better high level algorithm was not a better fix then optimising strlen or fucking with compiler options.

While it is possible for strlen to be the bottleneck, and for some applications it probably is (I remember from CPPCON that Facebook spent a lot of time optimising string handling because it turns out to matter to their workload), I would bet that for 99.9% of all C programs, the cost of strlen is noise.

Regards, Dan.

udok · « **Reply #2 on:** December 28, 2020, 10:30:38 pm »

I have no problem with strlen, but I wanted to know if there are differences between basic library functions like memcpy, strlen and strcpy.

I was surprised that the performance difference of such basic functions is 1:20 between different implementations.

And i was also surprised at how poor the performance of a simple C-compiler for loop strlen function is.

I learned a lot about modern CPUs and how much performance can be gained by a few lines of assembler code.
And anyway the testing code is not more than about 200 lines of C code.

But I agree that in 99% of programs a strcpy, memcpy or strlen is no problem.
But again, that's because modern processors are amazingly fast.

udok · « **Reply #3 on:** December 28, 2020, 11:05:06 pm »

Here is the performance graph of the strcpy function.

The fastest function uses the AVX2 based strlen to determine the length, and
then copies the string with a very simple memcpy based on "rep; movsb" loop.

Modern Intel and AMD processors optimize the "rep; movsb" loop to get very good
performance.
The performance for small strings (and for very large) is about 25% below the
best implementations.
The next CPU generation will fix this according to Intel.

AntiProtonBoy · « **Reply #4 on:** December 29, 2020, 01:06:21 am »

Quote from: dmills on December 28, 2020, 08:52:06 pm

I would want to see the actual application profiled and showing that strlen was a significant time sink (and exactly which strlens were burning the time) before even thinking of going down that rabbit hole, and then I would want to convince myself that a better high level algorithm was not a better fix then optimising strlen or fucking with compiler options.

While it is possible for strlen to be the bottleneck, and for some applications it probably is (I remember from CPPCON that Facebook spent a lot of time optimising string handling because it turns out to matter to their workload), I would bet that for 99.9% of all C programs, the cost of strlen is noise.

Regards, Dan.

I agree, synthetic benchmarks like these do have some academic merit, but at the end of the day most production code will have bigger bottleneck problems elsewhere.

SiliconWizard · « **Reply #5 on:** December 29, 2020, 01:36:38 am »

As a side thought, if getting the length of your strings is, or is part of, a bottleneck in a given code, you may want to implement your own strings with a length field associated with the characters. Then getting the length is O(1). Some languages already have such strings built-in.

dmills · « **Reply #6 on:** December 29, 2020, 09:28:41 am »

Yep, Pascal being the obvious one.

I would also note that the micro benchmark may well depend on exactly how things line up in memory, usually in C, my string manipulation is like strncpy (dest +5, source+3, ...) rather then copies from the start of memory regions that are aligned.
These benchmarks are interesting to compiler authors, but I would bet that for most real code the time in standard lib calls is in the noise.

Regards, Dan.

Nominal Animal · « **Reply #7 on:** December 29, 2020, 04:07:32 pm »

Quote from: dmills on December 29, 2020, 09:28:41 am

I would also note that the micro benchmark may well depend on exactly how things line up in memory, usually in C, my string manipulation is like strncpy (dest +5, source+3, ...) rather then copies from the start of memory regions that are aligned.

Exactly. I like to use a Xorshift64* PRNG (because it is very "random" – especially if you use the high 32 bits of the result only; but very, very fast) to randomize the access pattern, but use the same seed for each function; so that while the offsets and length varies, each different function gets the exact same parameters.

Another big issue is the cost to surrounding code. Here it is not so big problem, but it pops up when microbenchmarking sorting algorithms. Some algorithms, like radix sorting with 64-bit keys (either 64-bit integers or double-precision floating point using IEEE-754 Binary64 format, which is simple to "linearize"), need temporary storage, "consuming" cache. That does not show in a microbenchmark, but often affects other code running after the sort algorithm, as less of the working set is cached (compared to a sorting function that does not consume extra cache). Depending on the cache implementation, even the access pattern may affect how much of previous data is discarded from the cache.

(This is also why radix sort, having time complexity O(N), beating all comparison sorts (like Quicksort, Heapsort, etc.) because those all have at best O(N log N) time complexity, only beats them at a much, much larger N than one would expect, and why microbenchmarking will yield an overly optimistic (too low) N for the break-even point for real life workloads.)

SiliconWizard · « **Reply #8 on:** December 29, 2020, 08:05:10 pm »

Quote from: dmills on December 29, 2020, 09:28:41 am

These benchmarks are interesting to compiler authors, but I would bet that for most real code the time in standard lib calls is in the noise.

It all depends on what you use them for.
Memory copy functions can be used intensively in some applications, and their efficiency may matter.

But as we said, every time it would matter, it's often possible to circumvent it by rewriting your code, or implementing your own functions, or... Which means, relying on std lib functions for parts of your code that require high performance may itself be questionable.

From experience, there is one string function you should usually avoid altogether: sscanf. Not only it has inherent security flaws, but it's usually very inefficient. So if you're writing a parser and have minimal requirements for performance, just don't use it. For strlen(), as said above, there are many ways of avoiding it if needed. OTOH, one function that has proven (to me at least) hard to beat for general sorting is qsort() - on many platforms, for small or large datasets alike. I have tested a few sorting algorithms a while ago (coded in C), and when they were more efficient (took fewer cycles) than the std qsort(), that was often only marginally, or in very particular cases. So implementing your own sorting functions is rarely worth it (except again maybe in VERY particular cases.)

Nominal Animal · « **Reply #9 on:** December 29, 2020, 09:18:29 pm »

Quote from: SiliconWizard on December 29, 2020, 08:05:10 pm

OTOH, one function that has proven (to me at least) hard to beat for general sorting is qsort() - on many platforms, for small or large datasets alike.

I have the same experience.

However, sometimes you can piggy-back the sort on top of some other operation, spending a tiny bit more CPU time, but completing the task in way less wall clock time spent.

One of my favourite examples of this is the common C exercise of implementing a simple sort command to sort an input file line by line, in either ascending or descending order. (I prefer the POSIX example, using getline() and locale-aware strcasecmp(), but that's unimportant. For a real interesting case, one could use wide character input and wcscasecmp().)

I've seen programmers spend days in fine-tuning their sort algorithm, testing and microbenchmarking the alternatives.. and then blown clear out of the water by a simple, crude version that instead of reading each line into an array (of pointers), instead reads each line into a binary heap or similar self-sorting data structure. By the time the fine-tuned sorting version is ready to start its optimized sort, the crude version is already emitting its first output line.

You see, unless the dataset is small (in which case the algorithm does not matter), or the dataset is already resident in memory (page cache; this being unrealistic because if it is already in page cache, it should have been sorted as a side effect of the operation that got the dataset cached in the first place), the true bottleneck in larger operations is the I/O needed to read the input and save the output. (This is especially true if there is any kind of parsing involved, but already shows up in simple string sorting cases.) If instead of doing I/O and then computation (sorting), you do the two in parallel, you minimize the wall clock time taken.

A binary heap is a very good example of this. Half of the sorting work is done when inserting each line into the binary heap, and half when popping the lines out in sorted order. When the output is desired in ascending order, you use a binary min-heap; if in descending order, a binary max-heap; using an array representation for the heap. In both cases, the (amortized) cost per operation is O(log N) (for a binary heap; there are some other heaps that perform better), so nothing spectacular; but, because the cost is interleaved between I/O operations, the machine is essentially doing I/O and computation concurrently, and thus trading a little bit of extra CPU resources into getting the work done in less wall clock time.

I discovered this for myself way back when we still used spinning disks of rust, and many practical datasets were larger than would fit in memory. It still applies, although for microbenchmarking, you definitely want to make sure you clear your caches first. (In Linux, I do sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches; sync ; echo 3 > /proc/sys/vm/drop_caches' which is safe (does not discard unwritten data), before each microbenchmark run reading its inputs from a file.)

Sometimes, you don't want the operation to be as fast as possible, but instead consume as little resources as possible, especially as little RAM needed, but still keep the wall-clock time to within reasonable bounds (say less than 10x the time the fastest method could do it). These background tasks are common, but I see fewer and fewer programmers consider this at all.

For maximum computational efficiency, we need to keep the data flowing, with all cores assigned to our process doing useful calculation at all times, or we are wasting real-world time. Microbenchmarks are useful for the hot spots in the code, but their importance is nothing compared to how important the overall approach, underlying algorithm for the whole task at hand, is.

I often harp about how MD simulations still do calculation, then communication, then calculation, and so on, instead of doing them at the same time. It really, really bugs me, and has bugged me for two decades now. Instead of fixing this fundamental issue, they're now pushing some of the calculations to GPGPUs, and machines with a terabyte or so of RAM, just because they and their programmers cannot do distributed simulations right. Money thrown to the wind.

ve7xen · « **Reply #10 on:** December 30, 2020, 01:58:24 am »

The large size of the GCC binaries is probably due more to the fact that they are not stripped by default. On Linux the GNU linker emits dynamically-linked executables by default, and I assume the same is true on Windows. If you pass

Code: [Select]

-s to the linker (or GCC if you are doing a single command), it will be stripped of all the symbol information and should be much smaller.

It would be interesting to compare the assembly emitted to understand why there is such a large performance difference here. If you have forced MSVC to call out to the library, and GCC should be doing the same unless you have done something weird, I don't understand why the compiler would have any meaningful effect here at all; the only difference you could measure is the test loop and any call setup. So at a guess I'd say it's most likely due to something with the way you are compiling the code (differing options used for the compilers), or something with the way your microbenchmark is set up / optimized.

Source code?

udok · « **Reply #11 on:** December 30, 2020, 01:25:01 pm »

Quote from: ve7xen on December 30, 2020, 01:58:24 am

The large size of the GCC binaries is probably due more to the fact that they are not stripped by default. On Linux the GNU linker emits dynamically-linked executables by default, and I assume the same is true on Windows. If you pass
Code: [Select]
-s to the linker (or GCC if you are doing a single command), it will be stripped of all the symbol information and should be much smaller.

I compiled with "-s" (strip) and "-ffunction-sections -fdata-sections -Wl,-gc-sections" (remove unused functions).
Without the "-s" option, the binary has 293175 bytes!
Mingw gcc uses a mix of its own startup libs and the MSVCRT.DLL.
The Mingw binary has 11 different sections, MS binaries have only 4. I have no clue, what gcc is doing here.

Quote

It would be interesting to compare the assembly emitted to understand why there is such a large performance difference here. If you have forced MSVC to call out to the library, and GCC should be doing the same unless you have done something weird, I don't understand why the compiler would have any meaningful effect here at all; the only difference you could measure is the test loop and any call setup. So at a guess I'd say it's most likely due to something with the way you are compiling the code (differing options used for the compilers), or something with the way your microbenchmark is set up / optimized.

This is mainly a test of the different libc library functions (Intel Performance Lib, Agner Fogs asmlib and Microsoft MSVCRT).
In addition i added the simplest possible C for() loop as a base line reference, and added an optimzed AVX2 assembler version to show what is possible with modern CPUs.

The gcc compiler with the -O3 -march=native options produces good vectorized AVX2 code the last time i tested, but after the Mingw update to 10.2 last week gcc does not produce AVX2 output anymore...

I am surprised by the large difference in performance of these simple functions, and by the fact that the C for() loop is so bad in comparison.
Bit as many have already noted, for most programs the strcpy performance is no show stopper.

Quote

Source code?

Is attached for the strcpy test. This is the more interesting case, and i will add later the memcpy() test with sources if time permits.
The binaries are included for Windows 10, 64 bit.
I run the test with the bash script run.sh, the binaries are rebuild with the script build.sh (mingw or cygwin) .
The binaries are native Win 10 binaries and should work on a command prompt, run with test*.exe <string-length>

bd139 · « **Reply #12 on:** December 30, 2020, 01:33:11 pm »

Quote from: SiliconWizard on December 29, 2020, 01:36:38 am

As a side thought, if getting the length of your strings is, or is part of, a bottleneck in a given code, you may want to implement your own strings with a length field associated with the characters. Then getting the length is O(1). Some languages already have such strings built-in.

Also it’s a lot more complex than it looks. You have to consider encodings if it’s a string in 2020. Even assuming UTF-8 the length depends on which code points are used in the strings. I’m unlikely to ever use a native strlen because I’m either going to keep the buffer size in a struct or variable with it or I’m using something that does more than some ancient ANSI encoding. I’m really really interested in things that handle Unicode quickly.

SiliconWizard · « **Reply #13 on:** December 30, 2020, 04:36:26 pm »

Using strlen() will indeed never give you the number of "characters" of an UTF-8 string, but just the number of bytes up to the one with a zero value. But even with UTF-8 strings, sometimes all you need is the number of bytes and not the number of "characters". As to the merits of zero-terminated strings, it can be debated.

And implementing an "efficient" character-counting function for UTF-8 strings is not that easy - in particular, it's not nearly as easy to do this handling several bytes at a time (reading them as wider words / using vector operations / whatever...)

udok · « **Reply #14 on:** December 30, 2020, 04:37:38 pm »

Attached are the results of the memcpy function tests.

I attached the source code and the binaries too.
To build the sources, call ./build.sh form a Msys2 or Cygwin terminal.
To run the tests, call ./run.sh form the terminal.
The test program is only 155 lines of C code, compiled on Windows Msys2.
Don't forget to set the INCLUDE, LIB and PATH variable for the
Microsoft VC compiler.

Mingw Notes:
- gcc with -O3 -march=native now produces AVX2 code for the memcpy for-loop with good results.
The AVX2 code generation has not worked with the strcpy for-loop, i don't know why.

- gcc with -O2 produces the slowes result with only 0.8 Bytes / Clock.

- The Mingw binary is ridiculous large, over 250 kBytes without stripping.

Microsoft UCRT versus older MSVCRT Notes:
- The VCRUNTIME/UCRT DLL is faster than the old MSVCRT.
The good news is: The Win10 MSVCRT basically uses the new vcruntime.dll under the hood.

- The older VS 2008 compiler has problems with the builtin functions.
With -O2 the builtin functions are enabled, and are used instead of the MSVCRT functions.
Unfortunatly the builtin functions are very slow.
The new VS 2019 does not have this problem.

Intel Notes:
- The Intel compiler is very aggressive with inlining and calls the intel memcpy functions very often
- With optimizations turned on, the code gets very large, same as with gcc.

Memcpy Test Program Size:


test_3_memcpy_movsb_intr_cl_15.exe        6656
test_4_memcpy_movsb_asm_cl_15.exe         6656
test_5a_msvcrt_dll_cl_15.exe              6656
test_6_forloop_cl_15.exe                  6656
test_9_ucrt_dll_cl_19.exe                10752
test_2_AgnerFog_cl_15.exe                15360
test_1_Intel_icl_19.exe                  41472
test_7_forloop_O2_gcc_10.exe             41472
test_8_forloop_O3_gcc_10.exe             42496
test_5b_msvcrt_static_cl_15.exe          68096
test_9_ucrt_static_cl_19.exe            124416

- Most programs are linked dynamically with the MSVCRT.DLL or the VCRUNTIME.DLL.
- The performance with static linking seem to be better and more reproduceable,
especially in the region of over 30 Bytes / Cycle.

- The "rep; movsb" implentation relies on uOps Code and is fast and very small.
- The memcpy_movsb_intr test uses the C intrinsic __movsb(), which is inlined.
memcpy_movsb_asm uses an external assembler function which is not inlined.
- Function call on x64 is super fast, parameters are passed in registers and
inlining is in most cases not faster.

ejeffrey · « **Reply #15 on:** December 30, 2020, 06:33:57 pm »

Quote from: dmills on December 28, 2020, 08:52:06 pm

I would want to see the actual application profiled and showing that strlen was a significant time sink (and exactly which strlens were burning the time) before even thinking of going down that rabbit hole, and then I would want to convince myself that a better high level algorithm was not a better fix then optimising strlen or fucking with compiler options.

While it is possible for strlen to be the bottleneck, and for some applications it probably is (I remember from CPPCON that Facebook spent a lot of time optimising string handling because it turns out to matter to their workload), I would bet that for 99.9% of all C programs, the cost of strlen is noise.

I have definitely heard that memcpy/strcpy/memmove are a significant fraction of general purpose CPU cycles at places like facebook/google/amazon. But that is a different situation: if you are operating giant datacenters it makes sense to optimize something that is a few % of your total workload. Even for them I guess it is not a big factor in any one application -- the reason it is so big is that it is ubiquitous. It is easier to optimize one function in the C library than 1000 applications.

Quote from: udok

I am surprised by the large difference in performance of these simple functions, and by the fact that the C for() loop is so bad in comparison.

I'm not. The C for loop is operating 1 byte at a time. It has to work regardless of the alignment of the inputs/outputs and whether the length is a multiple of any particular word size. If the compiler were to convert it to multi-byte operations it would need to add alignment checks at the beginning of the call that either choose a fast vs. slow implementation or handle the ends slowly and let the fast algorithm do the middle, and you have to do that in a way that doesn't cause a big performance hit for short buffers which are probably the most common. Even detecting conditions that could be optimizes like this are hard for the compiler and if they did it would pretty much only work for cases already handled by the standard library, so there isn't really much point.

This is bad enough for memcpy where the length is known in advance, but impossible for strlen() and strcpy(). Keep in mind that accessing -- even just reading -- past the end of an array is undefined behavior. When you write a for loop that implements strlen() byte-by-byte and the compiler doesn't know where the input char* came from, it isn't supposed to even access past the null byte.

Rare but not impossible. consider the following code:

Code: [Select]

typedef struct {
   char name[7];
   volatile char flag;
} mystruct;

my_strlen(mystruct_inst->name);

This is a stupid thing to do, but if my_strlen does a for loop by bytes, then any vectorization done automatically by the compiler is invalid behavior. Now if you explicitly write a loop that access in 32 bit words, then the compiler is free to turn those into vector. In real life as long as you do naturally aligned access you won't cross a page boundary, memory protection boundary or venture into memory mapped IO so it is fine for the C library to provide a strlen() implementation that has a different contract, but the compiler isn't going to make that assumption for you.

Nominal Animal · « **Reply #16 on:** December 30, 2020, 06:41:18 pm »

Quote from: SiliconWizard on December 30, 2020, 04:36:26 pm

And implementing an "efficient" character-counting function for UTF-8 strings is not that easy - in particular, it's not nearly as easy to do this handling several bytes at a time (reading them as wider words / using vector operations / whatever...)

Yup. You have 1,114,112 (U+0000 through U+10FFFF, inclusive) code points, some of which combine to produce a single glyph. So, you really have byte count, code point count, and glyph count.

Properly implementing strcoll() or strxfrm() for Unicode is nontrivial, especially if you don't want to waste ~ 6 MiB or so just for code point categorization and case conversion.
But once it is implemented, in an efficient library, it is very nice to be able to handle just about any human-readable text without silly limits.

For regular expressions and other finite automata processing Unicode input, the implementation can simply be optimized to match short substrings of 1-3 bytes instead of bytes.

I for one prefer UTF-8 everywhere. I do understand it is not space-efficient for many Asian languages, but for now, it is way better than the hundreds of disparate encodings we have. I especially like the model of using POSIX iconv() to transform the occasional non-UTF-8 input to UTF-8 for processing, and optionally again also when outputting, because of its simplicity and robustness. (For example, gcc does exactly this.)

I do often recommend wide character and localization stuff for C, but that's just because they are part of standard C, and standard C just does not have good enough UTF-8 support yet. C99 added multibyte conversion support (e.g. mbsnrtowcs(), mblen(), mbrtowc(), etc.), but as usual, with anything dealing with wide characters or localization, Microsoft has long refused to implement this part of C properly (without requiring additional Windows-specific code). Hopefully, their C compatibility developers of the C++ runtime will be made to change their mind, considering Microsoft attitude changes in recent years.

Nominal Animal · « **Reply #17 on:** December 30, 2020, 07:02:26 pm »

Quote from: ejeffrey on December 30, 2020, 06:33:57 pm

The C for loop is operating 1 byte at a time. It has to work regardless of the alignment of the inputs/outputs and whether the length is a multiple of any particular word size. If the compiler were to convert it to multi-byte operations it would need to add alignment checks at the beginning of the call that either choose a fast vs. slow implementation or handle the ends slowly and let the fast algorithm do the middle, and you have to do that in a way that doesn't cause a big performance hit for short buffers which are probably the most common.

Yep. That's why having proper alignment and sufficient padding makes a big difference for the rare workloads having to compare and copy strings.

So much so that the example I mentioned, a sort command that reads each line into a binary heap structure, can be speeded up by ensuring each line is aligned to native word size boundary, and padded with at least one native word of zeroes as the string terminator. With that, you can write an optimized version (I like to use GCC inline assembly) to do the string comparison.

One of the extremely nice features about ELF binaries on Linux is the ability to use a function to resolve a symbol at dynamic link time. This is exactly what the target_clones function attribute does: you specify the architectures the compiler should optimize the function for, and it will generate a copy of the function for each, and the one best matching the hardware will be used at runtime. It will automatically generate the resolver function needed.

I prefer to do this by hand, using the ifunc function attribute, with a function that explicitly chooses the function among several variants based on environment variables; or, if none are defined, I'm ashamed to admit, /proc/cpuinfo pseudofile (because whatever logic you implement now, will almost certainly not be the optimal choice for future released processors, so any heuristic is likely to be often wrong – I prefer the user to control the selection). Command-line arguments are not easily accessible at the point where the resolver is called, but getenv() is.

ve7xen · « **Reply #18 on:** December 30, 2020, 10:10:47 pm »

Quote from: udok on December 30, 2020, 01:25:01 pm

I compiled with "-s" (strip) and "-ffunction-sections -fdata-sections -Wl,-gc-sections" (remove unused functions).
Without the "-s" option, the binary has 293175 bytes!
Mingw gcc uses a mix of its own startup libs and the MSVCRT.DLL.
The Mingw binary has 11 different sections, MS binaries have only 4. I have no clue, what gcc is doing here.

That is strange. I don't use Windows, but cross-compiling your strcpy code for that platform with mingw GCC 8.3, I get a 20KB .exe in x86_64 and 12KB in i686.

Quote

This is mainly a test of the different libc library functions (Intel Performance Lib, Agner Fogs asmlib and Microsoft MSVCRT).
In addition i added the simplest possible C for() loop as a base line reference, and added an optimzed AVX2 assembler version to show what is possible with modern CPUs.

I totally misunderstood your purpose here. I had thought your inclusion of a GCC result was to compare optimizers with a naive for loop, not the standard library implementations.

Quote

The gcc compiler with the -O3 -march=native options produces good vectorized AVX2 code the last time i tested, but after the Mingw update to 10.2 last week gcc does not produce AVX2 output anymore...

I don't *think* GCC is capable of vectorizing search loops, I believe it will only vectorize if the length is known at compile time.

Thanks for the sources. I'd expect the naive loop to be worse than the optimized library function, but it is pretty surprising how big the difference is. At some sizes the library is over 25x faster on my processor (glibc 2.30), which is a much larger difference than you observed with MSVCRT on your box.

I wasn't able to assemble your AVX implementation to compare that.

The Facebook CPPCON talk alluded to about their std::string implementation is pretty interesting. Of course, std::string always contains the length, so .size() is 'free' and strcpy becomes memcpy.

udok · « **Reply #19 on:** January 10, 2021, 01:27:33 pm »

I added the test results for the memcmp function.
for-loop results for the clang compiler is added too. The results are very good for -O1 optimizations, better than with -O2.

The mingw on Windows is configured with C++ 2011 "posix" thread support. This adds more startup overhead to programs, even programs written in plain old C.

The main purpose was to compare libc implementations. The for loop is a baseline for comparison. Then i found out that some compilers (gcc, intel) can produce very performant
SSE or AVX code under specific circumstances. This works for gcc even if the for loop length is not known at compile time, but only in special cases (memcpy, not memcmp or strcmp).
Sometimes the generated code is better or similar to libc. Unfortunatly this is difficult to predict and needs a separte object file, because the code bloat is too large for general purpose code.

Windows and Linux have different calling conventions. The assembler AVX code assumes that the parameters are passed in rcx, rdx and r8, Linux passes parameters in rsi, rdi and rdx.
It is a nighgmare, and makes life not easier. I am very interested in the glibc test results. glibc has very sophisticated implementation of the string functions and should perform top.
Unfortunatly my debian installation does not boot anymore, and i have to reinstall it first.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Which memcpy, memcmp, strcpy and strlen function is faster? (Read 5903 times)

Share me