Just googled the datasheets. Only got one of them. Google really goes off in the weeds with the Mega-0!
Huh, go figure...
Example part is a '3208, here's the mfg page,
https://www.microchip.com/en-us/product/ATmega3208or see the rest of the family from there.
The PIC's gated timer, for example, doesn't appear at all on an AVR, to my knowledge, and it can be used separately as a free-running timer and a flexible-sourced interrupt trigger. Or you can use them together to measure a pulse width in hardware, and present it to the ISR as the stopped timer value. Or you can use them to add a delay to an event trigger, by connecting the source to the gate and the destination to the timer overflow, and setting the timer to overflow "soon". Etc. And that's just one peripheral!
DA's timer D has some pretty interesting features, including generating and accepting events; I'm not sure about something like masked clocks, but there are duty cycle capture modes (you still have to do the division in software, but the period and width are measured in hardware anyway).
A general-purpose DMA would be a big game-changer, but the closest that either of them comes to that is dedicated to the USB peripheral alone, and it can only talk to a user-configured block of memory, not to another peripheral.
XMEGA A-series have DMA (and some of the larger ones in lower-letter series?). I haven't used it, but it looks pretty rich, pages of registers. So, you can probably pull tricks like one channel driving the others for rich scripting behavior, or maybe even Turing completeness, I have no idea. Not sure about DA and family; at least not the 64DAxx I was looking at.
There's also the CCL (configurable custom logic), which a lot of PICs have I think; and, integrating with the event system, you can get a lot of logic and sequential functionality that way.
Something kind of unique that I've seen on a few PICs, SMPS control -- you can build a peak current mode control for example, almost trivially; similar functionality I think is constructible with timers and events. Probably at greater expense to hardware resources -- that is, you're tying up a whole timer and a couple event channels to do it, and maybe that limits what else you can do -- but it's also not too common I would guess, that you need to make use of even a fraction of the available resources.
One of my recent projects deliberately has both a PIC and an AVR in it, talking to each other. The PIC manages a digitally-controlled analog signal path that has some precise timing involved to switch things in and out of circuit, relative to an external clock, so all of that timing is done in hardware with the CPU just "playing housekeeper" around it. The AVR uses its fast instruction rate to semi-bit-bang a fast pulse-width-based serial protocol. It uses the PWM module to create the exact timing in hardware, but the protocol is so fast that about all it can do (with GCC fully optimized for speed at the expense of size and still refactored until it worked) is get the next bit, choose one of two constants, busy-wait a few cycles for the PWM interrupt flag (less than the interrupt latency), set the new duty-cycle, and repeat. Once it's done sending a frame, it can let the main loop catch up. Everything else is set to be slow enough that it can be ignored for an entire frame of this without clobbering a hardware buffer.
Ah... interesting to note that, whereas MPLAB intentionally cripples its output; avr-gcc is just not very good at it. I've clocked it at about half the speed of hand-optimized assembly, at least for DSP (YMMV for other activities). GCC is good at optimizing, but it's done entirely on the internal representation (GIMPLE), and the target architecture is simply translated from that with little postprocessing (AFAIK?). It does very well for ARM and x86 -- of course, these are the most highly developed branches, or at least, I would guess -- but AVR and such, differ pretty significantly from the IR, and so there's a lot of wasted busywork, like, shifting around register allocations, extending sign into registers that aren't ultimately read, etc. And it only inlines 8x8 MULs; anything larger is sign-extended and called out to library (e.g. __mulhisi3). So any kind of arithmetic can be a challenge to optimize, short of assembling it yourself.
On the upside, the ISA is pretty reasonable, the main inorthogonality being some instructions limited to upper sets of registers (i.e. r16-r31, etc.). As a load-store architecture, it's rather verbose, so maybe not all that pleasant to purely assemble in.
I think the same is kind of true of PIC as well, just in different ways; everything has to pull through the W register, but you have fast page access and that. Dunno how adaptable it is, putting ASM with PIC-C. (What even is the compiler's ABI, how does it deal with hardware stack and page access? I haven't read any on it. Not that I care to; just that, this is critical information to be able to do that.) Instructions are "slower" too (~4 clocks/ins), but the clock is typically much higher so they're comparable in that respect. (Heck, same is true of competing ARMs -- those lacking pipelining and cache!)
I might have been able to do that with the PIC's modulation peripheral instead, which is essentially a 2:1 MUX with some extra flair, but I only thought of it just now. It's meant to take a bit stream as the selection input, and switch between two other signals, like DC and a defined frequency for On/Off Keying, or two different frequencies, or whatever. In this case, I would give it two fixed-duty PWM signals, both clocked in reference to the bit stream so that each bit gets exactly one period.
(Maybe both PWM's use the same timer, and one is also fed to the SPI's clock pin in slave mode, then the SPI's data goes to the modulator selection? Then I'd only have to keep up with bytes and not bits! If it also had a DMA, then this entire software driver would be obsolete! Just set it up and let it run, with a static array as its input.)
Sounds like something that should perhaps be expanded out into SPI frames or something instead? -- but obviously this isn't enough info to tell. Especially tricky if it needs perfect timing; these devices rarely(?) have buffered SPI so there will inevitably be dead time between frames, while the loop/interrupt latency cranks through. DMA can help, but may still suffer from latency due to bus contention. (In which case, a bus matrix may pay off -- which most of these have, I think, but you're still limited by priority access to SRAM or the like, as both DMA and CPU will be needing that from time to time.)
Siwastaja's example of bit-stuffed protocols would be the kind of example that, while you could transmit it via SPI (I mean, given plenty of other assumptions, too), the amount of cranking required to prepare the next frame is nontrivial. And might not be easily tamed with lookup tables -- do mind, you can easily pack in a 64kB table into these devices, as AVR's PGM space is 16-bit address and width, so supports 128kB without fumbling with extended addresses* (up to 384k are available).
*Except you'll still be fumbling for that extra bit because LPM (load from program memory) fetches a byte at a time. But if you place the table entirely in "high" memory, you probably don't have to touch the extended address register (RAMPZ).
Heck... y'know, they could've easily made that a load-word instruction instead, and loaded adjacent registers (e.g.
LPM r16:r17, Z+), or even removed the choice and make it implicit (say r0:r1, like the destination of MUL). Or put in byte and word variants. Shrug, it is what it is. I digress...
This might be a case of me thinking differently from the rest of the industry, but I have also yet to see a decently-documented library. The alternative is understanding multiple 1,000+ page datasheets, some of which are hard to find because you're supposed to use the library instead...
The problem that occurs here, is that, no one wants to deal with the complexity, of course -- so you have a cornucopia of options to choose from, none of which is especially better than the others. If you stick to the mfg tools (e.g. ST's CubeIDE stuff), they'll handle all this for you with a few switches allocating the hardware resources and such, and zoop -- code generation, there you go, add user code and you're off. Oh, and don't mind that 50kB bloat that you've just linked into your "blinky" application. I mean, they give 128kB+ of Flash on these things, but they also seem to do a damn good job using it up for you, as well.
And as I recall, not a lot of that bloat even drops off with -flto, it's not just superfluous crap, it's getting called from somewhere, somehow.
So, if you want to pare down the bloat, or speed up init, or just do things slightly differently from the official ways -- you're on your own.
One would hope for a sort of resource compiler, that doesn't just generate code snippets, but actually writes just the functions and initializers you need. But that would be a tremendous amount more work, to construct and test and debug and support, on top of the hardware selector, on top of the IDE, on top of whatever libraries and compiler support you need to get your new chips into mainline projects (GCC, libc, CMSIS, whatever). And it's only to support... people like us? Just so we can save a few bucks avoiding the upscale MCU that's more profitable for the manufacturer anyway? It feels intentional, but it's really just the coincidence that we're on the exact opposite end of mainstream development.
Speaking of profits -- these fancier AVRs tend to be rather expensive besides. Like, ATXMEGA64D3 is a bit over $4. ATMEGA3208, $1.25. AVR64DA64, $2. Ye olde ATMEGA328P is $1.50, and ATTINY402 from $0.50. They're making some improvements in the newer lines, and, I don't know offhand if XMEGAs were ever cheaper, if they're being phased out by raising prices or something; but the older classics clearly have staying power, like the MEGA and TINY. Clearly, XMEGA's been priced well above competing ARMs (e.g., ATSAME, STM32F0, etc.), and that's one of the smaller parts in the family even.
And, however that compares with PICs, you'll know better than me offhand, heh.
Tim