As Marco just pointed out you can only save a few multiply adds on some of the butterflies in the early stages of the FFT. I had the same problem using 2D FFTs with overlap-save image convolution, There is just no easy way to untangle the signal flow graphs and the extra overhead needed just to remove a few butterfly calculations would probably slow things down.
You might get some savings using the PFA or Prime Factor Algorithm as there are no interstage twiddle factors but the indexing is modulo depending on the prime factors. If you don't have hardware address generators available then the modulo addressing would require extra overhead.
I think your best bet, because your data is real, is to use the double length algorithm. Found a bit of Matlab code I'd written a while back, its only half completed because it only covers FFT lengths that are modulo 4.
% Function CFFT_Double_Length(R,N)
% Calculates FFT of real signal R(1:N) using length N/2 FFT
% N must be modulo 4 as only the 2nd case from Smith & Smith is used.
% Return complex result z(1:N)
% Reference: Handbook of Real-Time Fast Fourier Transforms
% Winthrop W. Smith and Joanne M. Smith
function z = CFFT_Double_Length(R,N);
N2 = N/2;
N4 = N/4;
% Pack real signal R(1:N) into complex array a(1:N/2)
for i = 0:N2-1
a(i+1) = complex(R(2*i+1),R(2*i+2));
end
% Calculate length N/2 FFT
b = fft(a);
% Separate outputs into real and imag parts
for i = 1:N4-1
rp(i+1) = 0.5*(real(b(i+1)) + real(b(N2-i+1)));
rp(N2-i+1) = rp(i+1);
rm(i+1) = 0.5*(real(b(i+1)) - real(b(N2-i+1)));
rm(N2-i+1) = -rm(i+1);
ip(i+1) = 0.5*(imag(b(i+1)) + imag(b(N2-i+1)));
ip(N2-i+1) = ip(i+1);
im(i+1) = 0.5*(imag(b(i+1)) - imag(b(N2-i+1)));
im(N2-i+1) = -im(i+1);
end
rp(1) = real(b(1));
ip(1) = imag(b(1));
rm(1) = 0.0;
im(1) = 0.0;
rm(N4+1) = 0.0;
im(N4+1) = 0.0;
rp(N4+1) = real(b(N4+1));
ip(N4+1) = imag(b(N4+1));
% Compute FFT output
for i = 1:N2-1
ar(i+1) = rp(i+1) + cos(i*pi/N2)*ip(i+1) - sin(i*pi/N2)*rm(i+1);
ar(N-i+1) = ar(i+1);
ai(i+1) = im(i+1) - cos(i*pi/N2)*rm(i+1) - sin(i*pi/N2)*ip(i+1);
ai(N-i+1) = -ai(i+1);
end
ar(1) = rp(1) + ip(1);
ai(1) = im(1) - rm(1);
ar(N2+1) = rp(1) - ip(1);
ai(N2+1) = 0.0;
z = complex(ar,ai);