About those caches (it's been a few years, I'm a bit rusty):
Caches aren't handled by the OS, they're done in hardware (probably microcode in modern CPUs). What determines cache hits and cache misses is your access pattern. The cache has no address space, it's just a cache of the ram. If I read a value from RAM location 0x5365, the value and address will be loaded into a cache line. The cache also keeps track of accesses, maybe which order, or maybe just a last accessed time (probably don't use too many bits for this). It also keeps track of whether the cache was written to, or just read, so it can write changes to the ram when that cache line is freed.
If the program doesn't access address 0x5365 for a bit, that part of the cache will be marked as free and another value and address will be stored in that cache.
If the program does access 0x5365 frequently, it will stay in the cache and the program will run faster.
If you keep the amount of memory you access smaller than your cache, you will have cache hits for all of it and the code will run much faster. If your cache is 4 lines, then your program will will start to slow down if you access a 5th word, since it'll have to load it from RAM.
Again, all of this is done in hardware, and it all appears to the code as one address space. I'm not sure how multiple caches work exactly, but you can think of each cache level as doing similar things, and each getting an order of magnitude slower. When accessing memory, the processor would check L0, if the address is there, use it, if not, check L1, if it's there, use it, if not, check L2, etc. all the way back to the RAM.
In multicore processors, it gets even more complicated and more dependent on which CPU you have. Cores on the same die might share an L3 cache, but have their own L2 and lower caches, or some other architecture. It can get even more complicated when you have multiple CPUs in separate chips that don't share L3 cache across the chips.
What you can do to optimize in assembly is more directly control access patterns to RAM, and control how much stuff is kept in registers, which are even faster than cache, particularly if you're using the same location very frequently.