Author Topic: AMD/Xilinx releases Microblaze V softcore (Read 3614 times)

asmi · « **on:** May 31, 2024, 02:36:12 pm »

Reference manual: https://docs.amd.com/r/en-US/ug1629-microblaze-v-user-guide/Introduction
Some features are still only available via lounge access, but it seems that there is enough "meat" to give it a try. It's included for free with Vitis 2024.1.

ataradov · « **Reply #1 on:** May 31, 2024, 03:50:38 pm »

So, now both MicroBlaze and NIOS are RISC-V. It is good to see things moving towards less proprietary designs, fewer random toolchains to maintain.

mark03 · « **Reply #2 on:** May 31, 2024, 04:05:38 pm »

Are there particular advantages of Microblaze over the free RISC-V designs one can find on github? Is it just because it's officially supported?

ataradov · « **Reply #3 on:** May 31, 2024, 04:22:37 pm »

Official support and integration with the rest of the ecosystem. Plus they have all the compatible peripheral IPs.

And I don't know about RISC-V, but the original Microblaze was hand optimized to the specific FPGA devices. It was full of manual placement attributes.

asmi · « **Reply #4 on:** May 31, 2024, 06:03:57 pm »

I should also note that both 32 and 64 bit versions are available, but there is no GUI support yet for the latter - you will need to manually adjust core parameters.

ejeffrey · « **Reply #5 on:** May 31, 2024, 07:08:14 pm »

Quote from: mark03 on May 31, 2024, 04:05:38 pm

Are there particular advantages of Microblaze over the free RISC-V designs one can find on github? Is it just because it's officially supported?

The main selling point is that they have a supported integrated design flow. You instance the processor, memory controllers, and peripherals in Vivado, wire them up and assign the peripheral addresses, and it generates a handoff file to vitis which creates a linker script and SDK including all the peripheral addresses. Then you can develop your program and compile it (and the toolchain is already set up and included once you instal vitis), and it will automatically build the executable into the ram initialization for programming.

None of this is particularly hard, and there are advantages to not being tied to their workflow. If it was just me, I would never use that, I find it clunky and annoying. But it's undeniably convenient for customers who are FPGA designers but not familiar with processor architecture or software. It's also more consistent, if I go and hand roll all of that, and then I leave the company and a new person joins, they have to go figure out what I did. If it's using the standard xilinx flow, in theory it takes them less time to get up to speed, and they can ask a xilinx AE for help.

I haven't used the risc-v microblaze yet, but the traditional microblaze is also very configurable for resources vs. performance. Some open source cores (like VexriscV) are extremely, even more configurable, but others are not. And performance and performance/area are quite well optimized for the architecture (much better than NIOS was in my experience).

julian1 · « **Reply #6 on:** May 31, 2024, 09:24:05 pm »

I was recently reading about AXI/ AMBA for handshake and bus arbitration.
Is the microblaze V the first instance of combining this SOC interconnect approach with a Risc-V core?

ataradov · « **Reply #7 on:** May 31, 2024, 09:27:55 pm »

Quote from: julian1 on May 31, 2024, 09:24:05 pm

Is the microblaze V the first instance of combining this SOC interconnect approach with a Risc-V core?

Not even close. Microchip/Microsemi Mi_V could use AXI/AHB for a long time. And all those cheap MCUs you see from WCH also use AMBA buses.

brucehoult · « **Reply #8 on:** May 31, 2024, 09:37:55 pm »

Quote from: ataradov on May 31, 2024, 03:50:38 pm

So, now both MicroBlaze and NIOS are RISC-V. It is good to see things moving towards less proprietary designs, fewer random toolchains to maintain.

This was announced back in November.

Microsemi started offering a RISC-V soft core in their FPGAs in 2017, Lattice in 2020, and Intel (Altera) in 2021, so Xilinx is the last of the major suppliers to do so.

Microsemi and Gowin also have RISC-V hard cores in their FPGAs, similar to Zynq. In the case of Microsemi it's a cluster of five 64 bit SiFive U54/S51 cores (four linux-capable, one "real-time"), running at 600 or 666 MHz depending on the grade. Basically a derated HiFive Unleashed in an FPGA. Those have been available on a $400 dev board ("Icicle") with a 250k LE FPGA since late 2021 and the whole range has been in mass production for about a year now I think. The BeagleBoard Beagle-V Fire uses one with 23k LEs for $150.

julian1 · « **Reply #9 on:** May 31, 2024, 09:43:56 pm »

Quote from: ataradov on May 31, 2024, 09:27:55 pm

Quote from: julian1 on May 31, 2024, 09:24:05 pm
Is the microblaze V the first instance of combining this SOC interconnect approach with a Risc-V core?

Not even close. Microchip/Microsemi Mi_V could use AXI/AHB for a long time. And all those cheap MCUs you see from WCH also use AMBA buses.

That's neat. I have no experience with either. But AXI seems much nicer than Wishbone.

ataradov · « **Reply #10 on:** May 31, 2024, 09:45:56 pm »

AXI is much newer and far more advanced than Wishbone. Wishbone could be in some way compared to AHB. And more realistically to APB. It is not designed for high performance, but for simplicity and ease of implementation.

langwadt · « **Reply #11 on:** May 31, 2024, 09:52:08 pm »

Quote from: ataradov on May 31, 2024, 04:22:37 pm

Official support and integration with the rest of the ecosystem. Plus they have all the compatible peripheral IPs.

And I don't know about RISC-V, but the original Microblaze was hand optimized to the specific FPGA devices. It was full of manual placement attributes.

afaict also smaller and faster,

https://docs.amd.com/r/en-US/ug984-vivado-microblaze-ref/Performance-and-Resource-Utilization
https://docs.amd.com/r/en-US/ug1629-microblaze-v-user-guide/Performance-and-Resource-Utilization

ataradov · « **Reply #12 on:** May 31, 2024, 09:59:43 pm »

I would expect so. Original Microblaze design was based on the fact that it would be implemented in the FPGA. Its instruction set is designed to be efficiently implemented in a typical FPGA primitive set.

RISC-V is not particularly designed for efficient FPGA implementation. Significant chunk of that comes from the immediate decoding. MB uses very simple linear encoding, which is not very dense, but does not need any special hardware. RV immediates are like spaghetti, but it increases instruction bit utilization. And when implementing in silicon, it does not make a lot of difference.

brucehoult · « **Reply #13 on:** May 31, 2024, 10:12:50 pm »

Quote from: ataradov on May 31, 2024, 09:59:43 pm

RISC-V is not particularly designed for efficient FPGA implementation. Significant chunk of that comes from the immediate decoding. MB uses very simple linear encoding, which is not very dense, but does not need any special hardware. RV immediates are like spaghetti, but it increases instruction bit utilization. And when implementing in silicon, it does not make a lot of difference.

Even in FPGA the immediate encoding is not awful. I think only one bit in the resulting 32 bit value can come from three different places in the instruction (or is it two places, or literal 0?) and the others from maximum 2 places. So the muxing overhead is not large. The burden of the literal encoding is mostly on assembler/disassembler authors.

RV32I can after all be implemented in 125 Xilinx LUTs in SeRV. That's the entire CPU, not just instruction decoding which obviously is only part of that.

ataradov · « **Reply #14 on:** May 31, 2024, 10:41:26 pm »

It looks like most lower bits have 3 sources - zero or two possible locations in the opcode. Bit 11 has 4 sources (zero or 3 locations). And higher bits have four sources - sing extension (from 3 different positions) or direct value. Only 6 bits have 2 sources - zero or a value.

It is not as bad as it could have been, but it does not map to LUT4 or even LUT6 well.

brucehoult · « **Reply #15 on:** June 01, 2024, 12:56:24 am »

Ah right .. it's 2 or 3 opcode sources OR a 0.

Sign extension is always from the same place: bit 31

ataradov · « **Reply #16 on:** June 01, 2024, 01:12:32 am »

Yes, I misinterpreted my notes on the sign extension. So, bits 31:12 have two sources - direct from the opcode or 31 (sign). This is not bad, indeed.

And given that bits 10:5 have two sources, there are 26 bits in total with just two sources. This is pretty good.

brucehoult · « **Reply #17 on:** June 01, 2024, 01:30:47 am »

And note that a Xilinx LUT can do a 2:1 mux for 2 output bits, if they have the same control signal.

ataradov · « **Reply #18 on:** June 01, 2024, 01:37:30 am »

I remember that in my implementation immediate decoding was second on a critical path after the pipeline hazards. I have not looked into the details, but guess most of it comes from the instruction decoding to pick the control signals, not the muxing itself.

I'll be looking into that closer soon, since I want to do a single bit implementation for fun. I have not looked into SeRV to avoid spoilers.

brucehoult · « **Reply #19 on:** June 01, 2024, 06:19:50 am »

Quote from: ataradov on June 01, 2024, 01:37:30 am

I remember that in my implementation immediate decoding was second on a critical path after the pipeline hazards. I have not looked into the details, but guess most of it comes from the instruction decoding to pick the control signals, not the muxing itself.

If you are like early microprocessors such as the 6502 and z80 which are designed to assume they are fed valid code and don't try to diagnose invalid instructions then you can minimize instruction decoding a lot. SeRV falls in that category, and I'd suggest you might as well be too if you're designing an embedded processor for minimal size and it will be running only a limited set of predetermined code.

In that case, RV32I instruction formats are entirely determined by bits 6:2 i.e. 5 bits. Bits 1:0 are always 11. Func3 (bits 14:12) determines a specific function in an opcode class but not the instruction format. Funct7 (bits 31:25) are all 0s if they are not part of a literal, except bit 30 distinguishes sub from add (it's effectively the carry in, and indicates whether to invert rs2) and right logical shifts from arithmetic shifts (i.e. it determines whether hi bits of the result get 0 or the original bit 31). So bits 31:25 play no part in determining the instruction format or the format of any literal. Fields rd, rs1, and rs2 similarly play no part in determining the instruction format or literal decoding (except as being part of a literal)

So any signal you want to generate from the instruction format can be generated from the original instruction bits using a single LUT.

e.g. if you want a signal "should bit 11 of the instruction's literal come from instruction bit 31, 20, 7, or be always 0?" That is a single Xilinx LUT outputting two signals from input of bits 6:2 of the instruction. Those two signals can then feed into a single LUT that also inputs bits 31, 20, and 7 and outputs bit 11 of the 32 bit constant.

So that's two LUTs and two LUT delays to get bit 11 of the constant/offset.

And that's the WORST case in decoding RV32I literals.

Bits 10:5 of the literal are either a constant 0 or else come from bits 30:25 of the instruction. So to generate those 6 bits each bit needs a single LUT that inputs bits 6:2 and one bit each from bits 30:25.

That's assuming no simplification at all from perhaps some of bits 6:2 of the instruction being "don't care" for some of the literal fields. The fields whose bits come from either 2 places, or 2 places or constant 0, could have their bits generated from the instruction bits by a single LUT per bit *if* it turns out that you only need 4 bits (or fewer) from bits 6:2.

That's for minimum delay decoding. Minimum LUT count decoding would say to use 1 LUT for, say, decoded literal bits 31:20 that outputs an "instruction bit 31:20, or bit 31" signal, and then 6 LUTs to generate output bits 31:30, 29:28, 27:26, 25:24, 23:22, and 21:20. (Bit 31 can of course be don with 0 LUTs but whatever). So 7 LUTs for 12 bits, with a 2 LUT delay. Half of the first LUT is available to generate some other useful signal.

Actually .. regarding invalid opcode detection ... that calculation can be done in parallel with the minimised generation of the possible literal (and other things) and the delay doesn't really matter as it only has to be valid in time to prevent register writeback or memory write. So it's purely a size thing.

glenenglish · « **Reply #20 on:** June 05, 2024, 05:09:16 am »

For me, I'd have to have a good reason to go away from the Microblaze, which I know well. It would be optimized for the architecture. Tools work, very flexible.
That been said, I'd be interested to see some sort of benchmarks of logic cells to performance ratio....

Does anyone know if there is any custom instruction extension available of the Xilinx port

I like that on the Efinix.
-glen


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: AMD/Xilinx releases Microblaze V softcore (Read 3614 times)

Share me