Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.
The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).
So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.
You probably didn't follow what I replied to hamster. There are several reasons I mixed both. (And they are usually, but *not always* handled separately. Whereas I've reasonably studied what is SOA, I'm implementing things with a specific set of requirements here.)
- Implementing a scheme that can be potentially reused for other ISAs without major modifications;
- Handling ALL branches in the same way;
- Security (yes, checking that a prediction actually belongs to the current branch instruction and not to another one will prevent a few potential exploit issues, so it doesn't just marginally improve prediction rate, but has another purpose);
- Simplifying logic (at the expense of more memory) - has benefits for design simplicity, verification AND length of logic paths.
As I implemented it, the branch predictor doesn't just predict branch direction (and target), it does so acting as some kind of cache. So basically, at the fetch stage, the corresponding predictor's table line is fetched, and at the decode stage, we have all info needed for branching without adding dependencies on the instruction decoding itself. It limits logic depth.
Implementing "gshare" should massively increase the prediction rate.
I've read a few papers and the "massively" looks pretty overrated. A couple papers notably find that on a mix of benchmarks, gshare is slightly better but nothing drastic.
Given the first results I get with my approach, there doesn't seem to be real room for "massive increase" anyway, and I'm not looking to extract the last few %. As I said, I'll leave that to Intel, AMD, ARM, (maybe SiFive?
) Not my area.
At the beginning, I was even considering no dynamic BP at all. With the one I implemented, I get an average of +15-+20% speedup, and even over 10% with degenerate cases such as the Takeuchi function, so I'm not convinced at this point I'm going to try and further improve this. As I said, the only thing I'm considering is a return stack buffer, but even that I'm not completely decided.
My questioning for the real benefit of RSBs is that except for the very small functions (or functions that are called often and return very early, thus not executing much on average), the context saving/restoring on the stack takes a significant number of cycles compared to the return itself, thus a misprediction penalty for returns may be relatively negligible. Of course there are always cases where it would make a difference, but not sure it's worth it.
An illustration of this is with the simple Takeuchi function test case I showed. You can see that even with a relatively poor prediction rate (still better than static), the average CPI is close to 1.1, which is close to the average CPI I get on a mix of benchmarks. Not only do I consider 1.1 a decent figure here (for my requirements anyway), but it seems to show that branch prediction for function calling specifically may be marginal in many cases compared to branch prediction of all other kinds of branches. Don't hesitate to prove me wrong here, but my tests and rationale why seem to hold.
Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.
My rationale explained above.
As to performance, I've done some more comparative testing today. Without tags, the prediction rate itself, not surprisingly, very slightly increases (but like marginally - something in the order of +0.1%), but the target prediction rate decreases significantly (several %).
Again, this is to be considered with a specific set of requirements and a specific implementation of the pipeline.
Of course, I'll be interested to see real tests of different approaches, and see how they compare, but raw performance is not the only criterion here.