Particular attention is needed on the last page and power supplies - I'm starting to think about connections between the SDRAM and the FPGA and I suspect I need to connect up some termination resistors somewhere, somehow?
As a rule of thumb, you only need dedicated termination if your traces are "electrically long", which means the length is a significant fraction of wavelength ("the magic number" is 10-15%). For 400 MHz, wavelength is 60 cm, so you will need to worry about termination if your traces are longer than 60-90 mm. Practically this means that if you only use a single DDR3 memory device and it's physically close to FPGA, your traces will most likely be shorter than that.
That said, you've got to treat these rules of thumb for what they are, and as such if in doubt I normally run board level simulations to figure out if I need termination, and if so, what is the best value for it. Even if you won't have any termination for address/control lines, you still should have differential termination for the clock line (typically it's a 100 Ohm resistor between + and - traces) as close to the receiver pins (memory device) as possible. I always use a 0402 resistor which is placed under memory chip right next to breakout vias which go to the top layer to the clock balls. This way I minimize the stub length.
What makes or breaks DDR3 interface is a layout. Here are few rules that need to be observed:
1. adjacent traces are at least 2H away from each other (where H - distance between a trace and a reference plane), and at least 1H in breakout regions (under BGAs as typically there is not enough space for 2H spacing).
2. reference plane(s) MUST be contiguous and should not have any cutouts under all DDR3-related traces. Again, this is not always possible under BGA, but do your best to satisfy this. Breaks in reference plane leads to BIG impedance discontinuity of traces (~50 Ohm to 80-90 Ohm) and causes signal reflections.
3. Clock lines have to be the longest traces among all DDR3 traces. See attached picture ("borrowed" from IMX6 datasheet) for all length matching rules. If the datasheet/user guide/application note for your FPGA calls for different tolerances, you them instead of what I posted.
4. Each additional transition between signal layers causes impedance breaks too, so minimize layer changes for data traces. Try your best to limit them to a single via on each end (in breakout regions of memory device and FPGA).
Also, the DDR3 needs a reference voltage - how would be best to supply this? Will it mean another power supply on the board?
There are devices designed specifically for this task - for example TPS51200. But in your case you can likely get away using a simple resistive divider - two 1K resistors with parallel 10nF filter caps, as well as a single 1uF decoupling cap between a midrail and the ground. At least that's how I've been doing it, and so far all my DDR2/3/3L interfaces always worked from the first revision. Just make sure this midrail is away from noisy traces.