Ok, I found the problem, the new code is in the Github repository. The problem was that the sample code in the datasheet to calculate a0-a2 and e0-e2 needed the second SHA256 chunk, which of course you can only calculate, if you have already calculated the first chunk, which is needed for the midstate anyway. So this is a test vector, to check if an Avalon chip works, clock settings included, for 0.3 A:
22600017, 00000000, f1fc122b, c7f5d74d, f2b9441a, 2eb9cc59, ddc34729, db8ae7e1, 6cb85659, 9a431c20, 9524c593, 05c56713, 16e669ba, 2d2810a0, 07e86e37, 2f56a9da, cd5bce69, 7a78da2d, 41bda849, 42a14495
Each 32 bit word is sent lowest byte and lowest bit first. This was calculated from the following getwork "data":
00000001ab02cd818b9e567ee21793cddef299feb29ad444a41b85b8000008a300000000c2b620e3758dfcff8bdb2304ae42b91e1e950e71aff797d7b09288fc2b12fcf14dd7f5c71a44b9f200000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000080020000
The Avalon chip should respond immediately with 0x42a14815 (and after 3 seconds and 17 seconds with two other values). You have to subtract 0x180 and then do a little-endian/big-endian swap to get the expected right nonce 0x9546A142.