At 7" and 480:320 ratio, the visible area should be about 148mm wide and 97mm tall, and the pixel pitch about 0.308mm ≃ 82 DPI.
I wouldn't go below 8, because then it has to be implemented in VHDL, and if you use numbers that are not powers of two, you have to use real mul units rather than trivial shifts.
Multipliers? Whatever for? You should be using counters and additions in the display generator, not muls. A state machine!
Consider using a framebuffer rotated 90 degrees, so that you have a 320×480 and 8×6 fonts instead. It is quite possible that the native refresh order is that anyway – it is so in most EastRising/BuyDisplay displays, for example, although most controllers allow reprogramming it. This fixes the character height at 8 pixels, and thus 40 rows. However, the scanning is from left to right, so you could support both 6- and 8-pixel (wide) fonts, the same way other devices support multiple font heights.
The video display controller character buffer I'd make 64×64, and use 8×8 "font" storage, for a total of 4096 + 2048 = 6144 bytes of RAM for the two. Your display generator should be a state machine, where your character lookup is some dozen cycles ahead of the font data lookup, which is three or four cycles ahead of the output pixel scanner. If you use 8×8 font storage, even the character buffer to font storage lookup is just multiplying by 8 either way. The two extra bytes could be used for character set type information for example.
If you have plenty of RAM, I'd warmly suggest considering more bits per the character, and thus larger character set. Unused bits in the character buffer might come in handy; for example, if the character buffer is 64×64 with 16-bit entries, but you only support 256/512/1024/2048/4096 distinct fonts (by only using 8/9/10/11/12 low bits of the character buffer values). Then, the two "extra" bytes in the 8×8 storage for 6×8 fonts could be used for unicode glyph type mask, and you could support UTF-8 (especially with a bit of help from userspace, reprogramming say the upper half of the character set depending on the glyphs (not just characters, but character combinations) needed to display at any point in time). If there is any ghosting, smooth scrolling (a pixel per frame) will yield more readable results than displaying a static frame for eight frames and then moving by an entire character, because the way the two characters mix in the latter case. (But ghosting is nasty, and I suspect will be the deciding factor with your displays.)
I wonder what the experience is like using a font smaller than 8x16 on a 7" LCD.
That's why I included the
link to a starting point: it includes several bitmap fonts on a web page where you can test how they look. You can screengrab bitmap images to be displayed on your LCD as soon as you get a picture generated.
To make a good decision, you have to look at the real-world display displaying real-world contents.
As to the ANSI escape code parser, having display buffer width be 64 (characters) means you need only one shift and one add to obtain the character buffer address for any screen coordinates... Row down is adding 64, and row up is substracting 64 from the current character buffer offset. If you use only 12 character buffer address bits, you get a very useful rollover at the boundaries, too: if you ever decide to make character-mode games (with custom character set), you can make very smooth endless horizontal and vertical scrolling with almost no "CPU" overhead – you need to replace less than 11 characters on average per pixel scrolled. Even so, I'd fix the character buffer to a fixed 64×64 size.
If the FPGA has the grunt, or it only does the display generation, and you have enough memory for a 19200-byte framebuffer, I'd definitely use a two-plane approach, where you have a character buffer on top and a graphics buffer at the bottom, with the actual displayed value looked up from a 2×2-bit configuration register. That way, enabling only one or the other is just manipulating that 2×2-bit config register; with AND/OR/XOR/NAND/etc. modes corresponding to different register values.
I am firmly of the opinion that this kind of device should have a (re)programmable part; I wouldn't implement it all in hard FPGA. Actually, I'd probably only implement the display scanning/generator part in FPGA, and leave the display buffer manipulation to something written in a higher-level programming language. C would be optimal, but assembly would be okay too. I personally would not want to write an ANSI escape code parser in VHDL or similar languages. Perhaps a small softcore?