Author Topic: CUDA Autorouter (Read 15326 times)

gaddra · « **on:** April 24, 2015, 07:41:22 pm »

I've been working on writing an autorouter with massive parallelization using CUDA. More of a learning exercise, but could potentially get somewhat decent (

)with brute force .

The algorithm works like this:
1. The initial board map is loaded
2. The net start and ending locations are marked on the map
3. Each cell (right now each pixel) checks its neighbors. If it's neighbor is the source end of the net or another direction vector, the cell points to it.
4. After a cell points to an adjacent cell, it will check the opposite cell. If the opposite cell is the destination of the net, the location is recorded. This triggers termination of the search.
5. Back on the CPU, the net is traced out starting at the destination by following the direction vectors back to the start.
6. Reset the direction vectors, and repeat.

So the shortest path is guaranteed , with optional preferred directions.

Here are some examples with 6 routes (starting points are random, not actually from a particular pad).

Single Layer:

Multilayer with a low via cost:

Multilayer with a high via cost:

Multilayer with annular rings:

Each route on this board (equivalent to a 5cm square board with 6mil clearance) takes ~0.2 seconds on a Tesla C2050, and ~3 seconds on my laptop's NVS5400M.

Source:
https://github.com/gaddra/Router/tree/Multilayer

free_electron · « **Reply #1 on:** April 24, 2015, 08:35:43 pm »

0.2 seconds for a single route ? on a board that has nothing routed ?
time to rethink your algorithm ....

c4757p · « **Reply #2 on:** April 25, 2015, 02:32:57 am »

Quote from: free_electron on April 24, 2015, 08:35:43 pm

0.2 seconds for a single route ? on a board that has nothing routed ?
time to rethink your algorithm ....

meh, it runs circles around FreeRouting

Bud · « **Reply #3 on:** April 25, 2015, 04:46:07 am »

Quote from: free_electron on April 24, 2015, 08:35:43 pm

0.2 seconds for a single route ?

Well, because it is ...

Quote

...autorouter with massive parallelization using CUDA

most of the time is spent to decide to which CUDA core to assign the task

gaddra · « **Reply #4 on:** April 25, 2015, 05:26:44 am »

I think the important thing to realize is that the time to solve has nothing to do with the complexity of the board, it is directly related to the length of the best net possible. If it takes significantly longer, it's a problem with placement not routing (after all, every possible path is explored at once). So 0.2s might sound high, but that time changes very little from net to net. It would be the same with blind vias, high layer counts, and dense layouts provided a reasonable solution actually exists.
In addition using this method the shortest distance from any point of the map to the source is found. If multiple pads are part of the same net, every connection is computed at once.
I'm not qualified to comment on kicad, this is a learning project in both C++ and CUDA. No idea how kicad works

As for deciding which CUDA core is assigned the task, the CUDA kernel has an array of data describing the state of the board and each thread operates on one piece of that data.

There are lots of optimizations to be made, primarily related to memory. Memory is a big bottleneck. My kernel is not written too well right now either.

benjamintmiller · « **Reply #5 on:** April 26, 2015, 03:40:54 am »

You're solving the easiest problem in autorouting -- How do I get from point A to point B with minimal cost and obstacles in the way. This sort of problem is often represented using a directed graph, though the pixel approach you've chosen will work.

Now try five points all on the same net - A, B C, D, and E. Find points to connect them that will minimize total distance. (Hint: you will not be able to brute force this: http://en.wikipedia.org/wiki/Steiner_tree_problem)

Now that you've come to a good approximation of the interconnects for a single net, you need to add all other nets in. Every trace that you add has to route around every existing trace, and every future trace must route around the current trace.

Circuit autorouting is one of the hardest problems in all of computer science. We have mathematically proven that it is impossible to brute force (with classical computers, at least). The reason all circuit autorouters use heuristics is because they sometimes give a solution that is good enough and won't run forever.

amyk · « **Reply #6 on:** April 26, 2015, 08:45:33 am »

Solving a theoretical problem, which essentially means arbitrary-precision coordinates for each point, is difficult; but in reality, points are constrained to a grid and that reduces the search space significantly.

"In theory, there is no difference between theory and practice. In practice, there is."

awallin · « **Reply #7 on:** April 26, 2015, 10:02:59 am »

Quote from: amyk on April 26, 2015, 08:45:33 am

Solving a theoretical problem, which essentially means arbitrary-precision coordinates for each point, is difficult; but in reality, points are constrained to a grid and that reduces the search space significantly.

I think the complexity comes from a combinatoric consideration, i.e. with increasing number of nodes the possible topology or connectedness of the network grows exponentially. This has nothing or very little to do with how you represent the node coordinates (pixels, integers, floats, whatever).

janoc · « **Reply #8 on:** April 26, 2015, 11:43:27 am »

It isn't clear from your description, but to me it looks like you are using a naive forward going search. Dijkstra's algorithm or even better the A* (a-star) algorithm are much better - optimal.

On the other hand, these are not easily parallelizable. I would strongly suggest forgetting CUDA for the time being, focusing the graph search first and then try to optimize by running things in parallel where it makes sense, because the data encoding/decoding overhead for the CUDA calls is very significant and makes little sense for small tasks.

DanielS · « **Reply #9 on:** April 26, 2015, 02:40:12 pm »

The demonstration would be more convincing if it showcased routing of an actual design instead of routing between arbitrary points.

In a real design, you would also need to consider delay-matching for parallel busses and their clocks/strobes, pairing of differential pairs and a bunch of other tings. But as a starting point, most people would likely be more interested in seeing how well your auto-router handles more realistic 10-12 lanes point-to-(multi-)point busses between digital ICs, a handful of miscellaneous serial interfaces or some arbitrary analog layouts in an actual design - can your router make sense of a sensibly placed real-world design? Can it do so comparably well and fast compared to other auto-routers?

As others have said, tree searches are one of the more efficient algorithms for this sort of application and those do not parallel very well since every route has to consider every current and future route. Outperforming them through a brute-force search might not be anywhere as easy or effective as it sounds, even before considering the routing needs of modern designs.

alexanderbrevig · « **Reply #10 on:** April 26, 2015, 02:57:48 pm »

Quote from: janoc on April 26, 2015, 11:43:27 am

I would strongly suggest forgetting CUDA for the time being [...]

Probably not the best advice given OPs intention with the project.

Quote from: gaddra on April 24, 2015, 07:41:22 pm

[...] More of a learning exercise, but could potentially get somewhat decent [...]

Nice project @gaddra! I recently had a conversation with someone on here about using Genetic Algorithms (with CUDA) for these kinds of problems. Interesting approach with the pixel map though

Mechanical Menace · « **Reply #11 on:** April 26, 2015, 03:09:56 pm »

Just out of interest why the choice of CUDA and not OpenCL?

Alex Eisenhut · « **Reply #12 on:** April 26, 2015, 03:15:51 pm »

How did they do it in the '60s?

tonyarkles · « **Reply #13 on:** April 27, 2015, 03:29:08 am »

Quote from: Alex Eisenhut on April 26, 2015, 03:15:51 pm

How did they do it in the '60s?

Very carefully! Probably on stacks of transparency/tracing paper.

gaddra · « **Reply #14 on:** April 27, 2015, 03:52:04 am »

Thanks for the feedback.

Quote from: Mechanical Menace on April 26, 2015, 03:09:56 pm

Just out of interest why the choice of CUDA and not OpenCL?

I have access to a Nvidia tesla cluster, and my laptop has nvidia graphics. I also preferred to use something more fully developed.

Quote from: joeqsmith on April 26, 2015, 03:04:54 pm

There's a lot to learn. My advice is to understand the hardware, then code around it.

Definitely, again why I chose to stick with the Nvidia architecture rather than trying to generalize.

Quote from: DanielS on April 26, 2015, 02:40:12 pm

The demonstration would be more convincing if it showcased routing of an actual design instead of routing between arbitrary points.

In a real design, you would also need to consider delay-matching for parallel busses and their clocks/strobes, pairing of differential pairs and a bunch of other tings. But as a starting point, most people would likely be more interested in seeing how well your auto-router handles more realistic 10-12 lanes point-to-(multi-)point busses between digital ICs, a handful of miscellaneous serial interfaces or some arbitrary analog layouts in an actual design - can your router make sense of a sensibly placed real-world design? Can it do so comparably well and fast compared to other auto-routers?

As others have said, tree searches are one of the more efficient algorithms for this sort of application and those do not parallel very well since every route has to consider every current and future route. Outperforming them through a brute-force search might not be anywhere as easy or effective as it sounds, even before considering the routing needs of modern designs.

The image is from a board I made, so the pads are from an actual design. I'm using pictures to import the data right now so I have no easy way of importing a netlist, hence the random points. I have been thinking about how to route multiple nets well in a parallelized way, and will be exploring some options in the future.

Quote from: benjamintmiller on April 26, 2015, 03:40:54 am

You're solving the easiest problem in autorouting -- How do I get from point A to point B with minimal cost and obstacles in the way. This sort of problem is often represented using a directed graph, though the pixel approach you've chosen will work.

Now try five points all on the same net - A, B C, D, and E. Find points to connect them that will minimize total distance. (Hint: you will not be able to brute force this: http://en.wikipedia.org/wiki/Steiner_tree_problem)

Now that you've come to a good approximation of the interconnects for a single net, you need to add all other nets in. Every trace that you add has to route around every existing trace, and every future trace must route around the current trace.

Circuit autorouting is one of the hardest problems in all of computer science. We have mathematically proven that it is impossible to brute force (with classical computers, at least). The reason all circuit autorouters use heuristics is because they sometimes give a solution that is good enough and won't run forever.

Thanks for the link, good to have a name to the problem. I'll think about what a decent way to go about this in this implementation and see how well/ not well it performs. The farthest i've gotten is short nets first, long nets last. Though I suppose that's the hard part

Quote from: alexanderbrevig on April 26, 2015, 02:57:48 pm

Nice project @gaddra! I recently had a conversation with someone on here about using Genetic Algorithms (with CUDA) for these kinds of problems. Interesting approach with the pixel map though

Interesting- there seems to be a big push moving neural net systems to the GPU, and nvidia recently released a deep neural network library. Ill take a look at that.

I've also cleaned up the code a bit and got some speed gain. Previously I had propagation from both ends of the net but removed that to implement multilayer support. Currently working on re-implementing that, and that should give a decent performance bump.

gaddra · « **Reply #15 on:** April 27, 2015, 04:03:22 am »

Quote from: janoc on April 26, 2015, 11:43:27 am

It isn't clear from your description, but to me it looks like you are using a naive forward going search. Dijkstra's algorithm or even better the A* (a-star) algorithm are much better - optimal.

On the other hand, these are not easily parallelizable. I would strongly suggest forgetting CUDA for the time being, focusing the graph search first and then try to optimize by running things in parallel where it makes sense, because the data encoding/decoding overhead for the CUDA calls is very significant and makes little sense for small tasks.

This is actually pretty similar to both of those searches (Dijkstra's is a type of A* as far as i'm aware). In A* you head toward the goal. The advantage of parallelization is you don't have to wait until you hit a block, you just explore every path at once. This shows it well:
http://nullprogram.com/blog/2014/06/22/

Data decoding and encoding is not much of an issue, the cuda kernel is very intuitive in that you can launch a 1D array of threads for 1D problems (adding two vectors), 2D arrays (like this), or 3D (also like this, multilayer boards). You can go as far as a 3 dimensional grid of blocks, and each block can have a 3 dimensional thread array. In addition it's spatially mapped to texture memory (not used right now). So if you're accessing data one row above it is physically close (fast access). In contrast a linear memory mapping (i.e. an array of WIDTH*HEIGHT length) would have the data at i + WIDTH locations away.
http://cuda-programming.blogspot.com/2013/02/texture-memory-in-cuda-what-is-texture.html

Alex Eisenhut · « **Reply #16 on:** April 27, 2015, 04:15:30 am »

Quote from: tonyarkles on April 27, 2015, 03:29:08 am

Quote from: Alex Eisenhut on April 26, 2015, 03:15:51 pm
How did they do it in the '60s?

Very carefully! Probably on stacks of transparency/tracing paper.

No, they knew they had to automate routing back in the 1950s.

For example
http://dl.acm.org/citation.cfm?id=320896&dl=ACM&coll=DL&CFID=506231791&CFTOKEN=15993635

In the '60s IBM was auto-routing the backplanes for their 360 and they said it took them half an hour. The attached images start about routing ICs, but the last paragraph on the first page talks about the board.

It's just tough to find all these papers and magazine articles. I mean they made 6 layer auto-routed boards with boards filled with flatpacks edge-to-edge. I guess unlimited government budgets for military toys does that. But all I have are pretty pictures. Never mind the source code...

eneuro · « **Reply #17 on:** April 27, 2015, 05:39:22 am »

Quote from: DanielS on April 26, 2015, 02:40:12 pm

The demonstration would be more convincing if it showcased routing of an actual design instead of routing between arbitrary points.

Even routing between arbitrary points in OP's single layer demo shows useless curcuit with traces between IC's pins, since he was not able to define such basic thing like trace width

I've almost finsihed cuGA (CUDA genetic alghorithm) implementation and testing it with real world problems.

Autorouting, but with given ICs automatic placement if they are not contrained to preselected position is much more interesting, while the most difficult and challenging process in PCB design is place elements in orientation and position against each other which will allow smart routing.
When those elements are nice drawn on PCB it is quite easy draw tracks manualy.

So, autorouting shown in OP's demos is useless in practice, since it doesn't do any placements of elements on PCB

BTW: How powerfull OPs Nvidia CUDA GPU is? 2000 cores or more?

Mechanical Menace · « **Reply #18 on:** April 27, 2015, 10:41:36 am »

Quote from: gaddra on April 27, 2015, 03:52:04 am

Quote from: Mechanical Menace on April 26, 2015, 03:09:56 pm
Just out of interest why the choice of CUDA and not OpenCL?
I have access to a Nvidia tesla cluster, and my laptop has nvidia graphics. I also preferred to use something more fully developed.

That makes a lot of sense. I like CL, and Nvidia have been very good at not artificially crippling it performance wise but the toolchain (or lack thereof) still needs a lot of work.

Thanks for the response.

gaddra · « **Reply #19 on:** April 27, 2015, 04:12:28 pm »

Quote from: eneuro on April 27, 2015, 05:39:22 am

Even routing between arbitrary points in OP's single layer demo shows useless curcuit with traces between IC's pins, since he was not able to define such basic thing like trace width

I've almost finsihed cuGA (CUDA genetic alghorithm) implementation and testing it with real world problems.

Autorouting, but with given ICs automatic placement if they are not contrained to preselected position is much more interesting, while the most difficult and challenging process in PCB design is place elements in orientation and position against each other which will allow smart routing.
When those elements are nice drawn on PCB it is quite easy draw tracks manualy.

So, autorouting shown in OP's demos is useless in practice, since it doesn't do any placements of elements on PCB

BTW: How powerfull OPs Nvidia CUDA GPU is? 2000 cores or more?

Width is able to be changed. I hacked together a way to read/visualize the results so that's why you don't see that. People seem to have higher expectations for this leaning exercise that's a few weeks in than products already on the shelf

I test primarily on a NVS5400M (96 cores) which is pretty low end. I also test on a Tesla C2050 which is a higher end older generation GPGPU (448 cores).

eneuro · « **Reply #20 on:** April 27, 2015, 06:08:09 pm »

Quote from: gaddra on April 27, 2015, 04:12:28 pm

I hacked together a way to read/visualize the results so that's why you don't see that.

Maybe because of to make prototype from PCB design I need something, which looks like usable to transfer to copper layer, not abstract random 1 pixel width path without source schematics?

BTW: This is my 1inch2 Joule Thief circuit prototype for THT

janoc · « **Reply #21 on:** April 27, 2015, 08:06:29 pm »

Quote from: gaddra on April 27, 2015, 04:03:22 am

This is actually pretty similar to both of those searches (Dijkstra's is a type of A* as far as i'm aware). In A* you head toward the goal. The advantage of parallelization is you don't have to wait until you hit a block, you just explore every path at once. This shows it well:
http://nullprogram.com/blog/2014/06/22/

Actually A* is an improved version of Dijkstra in the sense that it uses a heuristics to choose which path to explore first. In that way it achieves better average case complexity, but the worst case is the same as Dijkstra.

The problem with the approach in the blog post is that it is really very inefficient. You are doing a ton of calculation to achieve the same speed as the A* does on a CPU - sequentially. A* (and Dijkstra's) cleverness is in the fact that it avoids exploring paths that can't possibly produce a shorter path, saving a lot of work. Still way worth the occasional need to backtrack if it hits a dead end. Those algorithms are optimal for a reason ...

A more interesting algorithm to implement on a GPU could be the Floyd-Warshall one, there the parallelism could be put to good use if you use adjacency matrix as the graph representation.

http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm

kxenos · « **Reply #22 on:** April 27, 2015, 11:07:28 pm »

I think your algorithm should find a solution for an undirected 2-dimensional weighted graph. I think you should find all the 2D undirected graphs of the problem and select the ones that fit in the number of layers you use. Then you sort them according with each solution's average weight or total weight or something in between but I believe it's more complicated than this because you have to account for rotating parts or changing their layer and position if they're not locked. I think you should get an existing open-source algorithm and adapt it to use the cuda engine.

gaddra · « **Reply #23 on:** April 27, 2015, 11:15:43 pm »

Quote from: janoc on April 27, 2015, 08:06:29 pm

Actually A* is an improved version of Dijkstra in the sense that it uses a heuristics to choose which path to explore first. In that way it achieves better average case complexity, but the worst case is the same as Dijkstra.

The problem with the approach in the blog post is that it is really very inefficient. You are doing a ton of calculation to achieve the same speed as the A* does on a CPU - sequentially. A* (and Dijkstra's) cleverness is in the fact that it avoids exploring paths that can't possibly produce a shorter path, saving a lot of work. Still way worth the occasional need to backtrack if it hits a dead end. Those algorithms are optimal for a reason ...

A more interesting algorithm to implement on a GPU could be the Floyd-Warshall one, there the parallelism could be put to good use if you use adjacency matrix as the graph representation.

http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm

I'm failing to see how this would produce results at the same speed as a sequential algorithm. A* avoids paths that can't produce a shorter path, but it can still fail. Once it fails it has the explore a new path (so this would be worse and worse for graphs with more unwalkable nodes). In a parallel situation, all paths are explored. I found a short paper on comparing A* on the CPU vs GPU:
http://www.idi.ntnu.no/~elster/tdt24/tdt24-f12/presentations/lars-espen-presentation.pdf

I'll look at Floyd-Warshall, thanks for the link.

janoc · « **Reply #24 on:** April 28, 2015, 01:29:36 pm »

Quote from: gaddra on April 27, 2015, 11:15:43 pm

I'm failing to see how this would produce results at the same speed as a sequential algorithm. A* avoids paths that can't produce a shorter path, but it can still fail. Once it fails it has the explore a new path (so this would be worse and worse for graphs with more unwalkable nodes).

A* (and Dijkstra) are guaranteed to find the shortest path, if it exists. So no failure is possible there. It may need to backtrack from a dead end, that is normal. On the other hand, the parallel code will explore tons of paths that A* will not even consider, discarding 99.9% of the data it has searched. And on really pathological cases (such as the "spiral" graph from that website) the parallel solver will not be better than A*/Dijkstra neither, because there is only a single solution and both solvers have to search the entire map sequentially.

I am only going from the website you have linked - the author himself says that his parallel implementation is actually *slower* than sequential CPU A* and is only now approaching the speed of the A*.

Quote from: gaddra on April 27, 2015, 11:15:43 pm

In a parallel situation, all paths are explored. I found a short paper on comparing A* on the CPU vs GPU:
http://www.idi.ntnu.no/~elster/tdt24/tdt24-f12/presentations/lars-espen-presentation.pdf

I'll look at Floyd-Warshall, thanks for the link.

Right, but that is A* both on the GPU and CPU. You didn't implement A* on the GPU, did you? You are just launching a set of "brute force" searches in parallel and hoping that one will actually reach the goal point.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: CUDA Autorouter (Read 15326 times)

Share me