OTOH, one function that has proven (to me at least) hard to beat for general sorting is qsort() - on many platforms, for small or large datasets alike.
I have the same experience.
However, sometimes you can piggy-back the sort on top of some other operation, spending a tiny bit more CPU time, but completing the task in way less wall clock time spent.
One of my favourite examples of this is the common C exercise of implementing a simple
sort command to sort an input file line by line, in either ascending or descending order. (I prefer the POSIX example, using
getline() and locale-aware
strcasecmp(), but that's unimportant. For a real interesting case, one could use wide character input and
wcscasecmp().)
I've seen programmers spend days in fine-tuning their sort algorithm, testing and microbenchmarking the alternatives.. and then blown clear out of the water by a simple, crude version that instead of reading each line into an array (of pointers), instead reads each line into a binary heap or similar self-sorting data structure. By the time the fine-tuned sorting version is ready to start its optimized sort, the crude version is already emitting its first output line.
You see, unless the dataset is small (in which case the algorithm does not matter), or the dataset is already resident in memory (page cache; this being unrealistic because if it is already in page cache, it should have been sorted as a side effect of the operation that got the dataset cached in the first place), the true bottleneck in larger operations is the I/O needed to read the input and save the output. (This is especially true if there is any kind of parsing involved, but already shows up in simple string sorting cases.) If instead of doing I/O and then computation (sorting), you do the two in parallel, you minimize the wall clock time taken.
A binary heap is a very good example of this. Half of the sorting work is done when inserting each line into the binary heap, and half when popping the lines out in sorted order. When the output is desired in ascending order, you use a binary min-heap; if in descending order, a binary max-heap; using an array representation for the heap. In both cases, the (amortized) cost per operation is
O(log
N) (for a binary heap; there are some other heaps that perform better), so nothing spectacular; but, because the cost is interleaved between I/O operations, the machine is essentially doing I/O and computation concurrently, and thus trading a little bit of extra CPU resources into getting the work done in less wall clock time.
I discovered this for myself way back when we still used spinning disks of rust, and many practical datasets were larger than would fit in memory. It still applies, although for microbenchmarking, you definitely want to make sure you clear your caches first. (In Linux, I do
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches; sync ; echo 3 > /proc/sys/vm/drop_caches' which is safe (does not discard unwritten data), before each microbenchmark run reading its inputs from a file.)
Sometimes, you
don't want the operation to be as fast as possible, but instead consume as little resources as possible, especially as little RAM needed, but still keep the wall-clock time to within reasonable bounds (say less than 10x the time the fastest method could do it). These background tasks are common, but I see fewer and fewer programmers consider this at all.
For maximum computational efficiency, we need to keep the data flowing, with all cores assigned to our process doing useful calculation at all times, or we are wasting real-world time. Microbenchmarks are useful for the hot spots in the code, but their importance is nothing compared to how important the
overall approach, underlying algorithm for the whole task at hand, is.
I often harp about how MD simulations still do calculation, then communication, then calculation, and so on, instead of doing them at the same time. It really, really bugs me, and has bugged me for two decades now. Instead of fixing this fundamental issue, they're now pushing some of the calculations to GPGPUs, and machines with a terabyte or so of RAM, just because they and their programmers cannot do distributed simulations right. Money thrown to the wind.