Author Topic: dependency manager, better loop detector  (Read 1447 times)

0 Members and 2 Guests are viewing this topic.

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
dependency manager, better loop detector
« on: June 21, 2024, 02:12:32 pm »
This is the second time I have to write something like this.
(part of a small packet manager)

Basically you have a list of libraries (written in C/myC), and each library may have dependency with other libraries in the list.
Code: [Select]
+ (0001) lib_1
    (0002) <lib_2> success
    (0003) <lib_3> success
+ (0002) lib_2
    (0003) <lib_3> success
+ (0003) lib_3
    (0004) <lib_4> success
+ (0004) lib_4

Code: [Select]
0001: | 0 1 1 0 |
0002: | 0 0 1 0 |
0003: | 0 0 0 1 |
0004: | 0 0 0 0 |
(here it is the dependency matrix, with the libraries on the rows, and the dependencies with other libraries on the columns)

The problem to be solved is a "simple" topographic ordering, which is  possible and sensible though, only if there are no loops!
So you need a loop detector!

Code: [Select]
traversed_deps[1, 2]={ 1 2 3 4 } step5 is_EOL1=0 is_stop=1 is_loop=0
traversed_deps[1, 3]={ 1 3 4 } step4 is_EOL1=0 is_stop=1 is_loop=0
traversed_deps[2, 3]={ 2 3 4 } step4 is_EOL1=0 is_stop=1 is_loop=0
traversed_deps[3, 4]={ 3 4 } step3 is_EOL1=0 is_stop=1 is_loop=0
loop test: success, no loops detected

Code: [Select]
you should complile your libraries in this order:
lib_4
lib_3
lib_2
lib_1

Simple example, where there are no problems!

-

In this example ... there are loops!

Code: [Select]
+ (0001) lib_1
    (0002) <lib_2> success
+ (0002) lib_2
    (0003) <lib_3> success
+ (0003) lib_3
    (0004) <lib_4> success
+ (0004) lib_4
    (0001) <lib_1> success
Code: [Select]
0001: | 0 1 0 0 |
0002: | 0 0 1 0 |
0003: | 0 0 0 1 |
0004: | 1 0 0 0 |
Code: [Select]
traversed_deps[1, 2]={ 1 2 3 4 } step5 is_EOL1=0 is_stop=0 is_loop=1
traversed_deps[2, 3]={ 2 3 4 1 } step5 is_EOL1=0 is_stop=0 is_loop=1
traversed_deps[3, 4]={ 3 4 1 2 } step5 is_EOL1=0 is_stop=0 is_loop=1
traversed_deps[4, 1]={ 4 1 2 3 } step5 is_EOL1=0 is_stop=0 is_loop=1
loop test: failure, several loops detected!

-

The problem is how to write the loop detector!

I have always written it with the "brute force" method, that is, it does not make any "algebraic consideration" or "algebraic calculation" on the dependency matrix as "matrix", on the contrary, for every single cell that it does not find equal to "0" (or False if you prefer, cells are boolean_t in my case), it checks that the path is not closed, that is, starting from X position .... you don't go back to that X position.

"brute force path test"

It works, but if the matrix is ​​1.000 x 1.000 (1.000 libraries in the list) ... there are ~40% 1.000.000 = 400.000 path-tests, which is not even remotely sub-optimal but a huge waste of cpu cycles!

So ...  :-//

-> doesn't someone who studies graphs, and these things in university, have a better idea?
-> are there any algebraic considerations that can be made?
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #1 on: June 21, 2024, 02:16:28 pm »
(
this algorithm and approach can also be used to plan the sequence of services to run when starting and shutting down a UNIX-like system!
What /sbin/init should execute, in what order, when what it should execute has dependencies on what comes before it (dependency)
)
« Last Edit: June 21, 2024, 02:42:18 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6827
  • Country: fi
    • My home page and email address
Re: dependency manager, better loop detector
« Reply #2 on: June 21, 2024, 03:33:47 pm »
Ah yes, the old dependency graph order problem.

Instead of tackling the problem all at once, I prefer the approach where at each iteration, you obtain the package (or a set of packages if you have a dependency loop) you need to build/install next.

You describe each package as a node, with its unfulfilled dependencies listed as a set of edges.  Whenever a package is installed, it and all edges leading to it will be removed from the working data set.

Normally, you pick the next one among the set of packages with no unfulfilled dependency edges.
If there are no such packages left, then you have a loop, and that loop is removed as one, as a set of packages that the user needs to build in tandem somehow –– or fix the dependency database.

While this obviously has O(N²) time complexity at best, you can make it sufficiently fast for even large N by keeping the node set sorted by the number of unfulfilled dependencies.  This way, you quickly resolve what to build/install next, even though doing the full ordering for every package can take significant time from a human perspective.  (I believe this is one of the cases where reducing the latency to get the next result in sequence is more important than minimizing the overall total time taken to compute the order for all packages.)

One way to resolve loops in practice is to split dependencies into required and optional edges.  The logic is the same as above for required edges only, except that as long as a node has optional unfulfilled dependencies left, it is not removed from the set (nor are the dependecies on it pruned out from other nodes).  In building from sources, this can lead to having to rebuild the same package, whenever a new optional dependency is fulfilled (but otherwise would have lead to a circular dependency).
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #3 on: June 21, 2024, 03:47:12 pm »
Yup, something like this  :D

Code: [Select]
0001: | 0 1 0 0 |
0002: | 0 0 1 0 |
0003: | 1 0 0 0 |
0004: | 0 0 0 0 |

Code: [Select]
traversed_deps[1, 2]={ 1 2 3 } step4 is_EOL1=0 is_stop=0 is_loop=1
traversed_deps[2, 3]={ 2 3 1 } step4 is_EOL1=0 is_stop=0 is_loop=1
traversed_deps[3, 1]={ 3 1 2 } step4 is_EOL1=0 is_stop=0 is_loop=1

Without the loop-detect, you can try to remove a dependency

Code: [Select]
step1: removing dep4 ... success
0001: | 0 1 0 * |
0002: | 0 0 1 * |
0003: | 1 0 0 * |
0004: | * * * * |
step2: failure

but then the loop prevents you from continuing.

Code: [Select]
    // is_loop = dep_matrix_cyclical_loop_detect(dep_matrix, item_n);
    is_loop=False; /* Let's try to ignore the loop_detector, to see what happens */
    if (is_loop isEqualTo False)
    {
        dep_topological_sort(dep_matrix, item_n, sequence);
    }

So in the end you understand that there is a loop also from the failure of the Topological sorting method  :o :o :o
« Last Edit: June 21, 2024, 04:25:01 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #4 on: June 21, 2024, 04:01:25 pm »
One way to resolve loops in practice is to split dependencies into required and optional edges.  The logic is the same as above for required edges only, except that as long as a node has optional unfulfilled dependencies left, it is not removed from the set (nor are the dependecies on it pruned out from other nodes).  In building from sources, this can lead to having to rebuild the same package, whenever a new optional dependency is fulfilled (but otherwise would have lead to a circular dependency).

Talking about building from sources, which is my case, since 2005, with Gentoo, there has been a way to "reduce" dependencies by e.g. reconfiguring use-flags ad-hoc to avoid loops; in practice you "emerge" libraries and tools with the smallest possible number of dependencies, or you deliberately try to avoid "certain" problematic dependencies (Python, Perl, ... LaTex-core, usually, but there are many others), and then, later, trying to somehow fix everything, since the next sync& system-update.

It works, but it always takes a long time (days, weeks, ...) to fix these things.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3878
  • Country: us
Re: dependency manager, better loop detector
« Reply #5 on: June 21, 2024, 08:33:19 pm »
gnu tsort is part of coreutils:

Code: [Select]
tsort << EOF
l1 l2
l1 l3
l2 l3
l3 l4
l4 l1
EOF
tsort: -: input contains a loop:
tsort: l1
tsort: l3
tsort: l4
l1
l2
l3
l4

It detects the loop and prints a loop.
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #6 on: June 22, 2024, 08:29:33 am »
CoreUtils/tsort looks interesting.
In my case I'm writing an integrated application that resolves dependencies, compiles and installs things, all in one, but surely tsort can be used in a bash script.
I'll look at the source code to see how it detects loops
« Last Edit: June 22, 2024, 11:58:13 am by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15234
  • Country: fr
Re: dependency manager, better loop detector
« Reply #7 on: June 22, 2024, 09:01:35 am »
A topo sort works well for this. https://en.wikipedia.org/wiki/Topological_sorting

I have implemented a variant of depth-first search for handling dependencies in a pool of threads (for distributing processing across multiple threads).
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #8 on: June 22, 2024, 12:03:01 pm »
I have implemented a variant of depth-first search for handling dependencies in a pool of threads (for distributing processing across multiple threads).

Do you have a loop-detector or have you modified the depth-first search algorithm so that it fails when it encounters a loop
(that's what I have currently done)
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #9 on: June 22, 2024, 12:10:11 pm »
Meanwhile, the dependency matrix has been replaced with a bitmap. This saves a lot of memory since the cells in the array were boolean_t , so 8 bits at most in the best case, whereas now it's 1 bit in a bitmap.

8 times less memory

Sure, managing a bitmap costs more CPU cycles than accessing the cells of an array, but not that much and it's worth it.

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #10 on: June 22, 2024, 01:13:47 pm »
-> are there any algebraic considerations that can be made?

So, this seems to be the answer to my second question about the loop-detector.
(from math-stackexchange, questions 513288)

In summary, yes, some algebraic considerations can be made, but it is not worth it compared to the empirical approach :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6827
  • Country: fi
    • My home page and email address
Re: dependency manager, better loop detector
« Reply #11 on: June 22, 2024, 04:33:09 pm »
Note that the matrix bitmap uses P*P/8 bytes of memory, whereas an array-based approach uses 4*(M+1)+4*P+4*D bytes, where P is the number of packages, D is the total number of dependencies (summed across all packages), and M is the maximum number of dependecies per package.

The matrix approach is good for small P, but inefficient for larger number of packages.
For example, for P=1024 packages, the bitmap matrix uses 131,072 bytes of memory.
For P=1024, D=16*P, and M=256, the array approach uses only 70660 bytes.

The array approach describes a set of M+1 pools, as in
    uint32_t  pool_size[M+1]; // number of entries in each pool
    uint32_t *pool_list[M+1]; // pool
where pool_list[p] consists of pool_size[p] groups of p+1 consecutive words.
The first word is the package ID, and the rest are the IDs of uninstalled packages it depends on.
Thus, pool 0 consists of packages without unfulfilled dependencies, and is basically a list of IDs for such packages.
Pool 1 consists of packages with a single unfulfilled dependency, and thus two words per entry; pool 2 has three words per entry; and so on, up to pool M which has M+1 words per entry.

The array-based approach uses a single array, with pool 0 starting at the beginning of the array, pool 1 at offset pool_size[0], pool 2 at offset pool_size[0]+2*pool_size[1], pool 3 at offset pool_size[0]+2*pool_size[1]+3*pool_size[2], and so on, with pool p at offset sum((k+1)*pool_size[k], k=0..p-1).  No dynamic memory allocation needed.  If you have a fixed-size array, and you don't know M beforehand, just put it at the top of the array growing down.

If the first pool is empty, but not all pools are empty, you have a dependency loop.  This representation is not very good for identifying the IDs involved in a loop, so you might wish to handle that using some other representation, optionally after remapping the ID's left to 1..N.
That is, this method will not tell you whether you have loops or not up front, nor will it easily tell you the ID's involved in a loop.  It will only tell you if there are no immediately installable packages, because there is a dependency loop.

To find the next installable package, you only need to grab it from pool 0.  (You usually pick the first one, decrementing pool_size[0], and moving the array contents one word towards smaller indexes.) You then process each of the other pools one at a time, to remove the ID from the unfulfilled dependecies in other packages.

You iterate through each pool once, starting with f=0.  When an entry has the ID as its dependency, swap that ID with the final dependency, and swap the entry with f, and increment f.  This way, the affected entries will be at the beginning of the pool.  Next, you move the first f entries to the previous pool by iterating k=1..f-1, inclusive, moving the entry (without the to-be-removed ID in the last word) k words towards smaller offsets.  Finally, you move the rest of the data in the array by f words to smaller offsets, increment the previous pool size by f, and decrement the current pool size by f.

(You can also maintain input index and output index, to postpone moving the data (but not swapping the entries), so that you don't move already moved data.  It is a bit trickier to implement, so I suggest implementing the above first.)

To extract entire loops, you'll use some other representation to find the IDs involved.  Applying the above scheme for each ID to-be-removed in turn will be sufficiently efficient; removing them all at once in a single loop over the entire array might be faster, but the code needed to do it robustly (especially since packages may jump several pools down) is much more complicated, and likely not worth the effort.
« Last Edit: June 22, 2024, 04:36:36 pm by Nominal Animal »
 
The following users thanked this post: DiTBho

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15234
  • Country: fr
Re: dependency manager, better loop detector
« Reply #12 on: June 23, 2024, 08:52:37 am »
I have implemented a variant of depth-first search for handling dependencies in a pool of threads (for distributing processing across multiple threads).

Do you have a loop-detector or have you modified the depth-first search algorithm so that it fails when it encounters a loop
(that's what I have currently done)

It does detect circular dependencies specifically, as is shown in the "Depth-first search" section of the Wikipedia article. You use a 'mark' and 'temporary mark' per node, and if the temp mark is set when visiting a node, then it's a circular dependency. So, it doesn't just "fail", but you know for sure there was a loop.
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #13 on: June 23, 2024, 01:32:48 pm »
So, it doesn't just "fail", but you know for sure there was a loop.

If it fails, that's the only reason: there is at least one loop  :D
but which one? And how many are there?  :o :o :o

It can stop here, or use a more refined analysis.

In Gentoo/Portage I always wanted to have this functionality: that is, knowing "clearly" which dependencies are in the loop.
Yes, the Portage somehow shows something like it, unfortunately... the information is often confused, and you have to go by trial and error, which then gets worse due to the fact that there are also slops (not considered as dependencies, but they are to all intents and purposes dependencies) to further complicate the scenario of alternative.

In the application I'm writing, if the dependencies are in a loop, I also need to know how to break the loop, possibly in the best possible way to take as little time as possible.

The exhaustive approach I was talking about (dep_matrix_cyclical_loop_detect()) uses and cycles through an "adjacency matrix", which consumes more memory than the list approach, and it also consumes more CPU cycles but it succeeds in answering this question.

If you notice the examples I brought above, they have a variable (boolean_t is_loop) that becomes True precisely if there are loops, so you also know which nodes the loop is made up of, and you also know how many it found for each iteration.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3878
  • Country: us
Re: dependency manager, better loop detector
« Reply #14 on: June 23, 2024, 05:35:30 pm »
So, it doesn't just "fail", but you know for sure there was a loop.

If it fails, that's the only reason: there is at least one loop  :D
but which one? And how many are there?  :o :o :o

It can stop here, or use a more refined analysis.

When the toposort fails, you have two elements of the loop and one edge-- the node you are visiting and the pending node it depends on.  From there you just need to use a shortest path algorithm to find the path from the pending node to the visiting node and you have your loop.

Finding all cycles generally requires a bit more specificity as there can be multiple interlocking cycles.  There are tons of graph algorithms built around various forms of cycle analysis.  A simple one is "strongly connected components".  This breaks a graph into nodes that can all teach each other.  All interlocking cycles get grouped together but non overlapping cycles are separate.
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15234
  • Country: fr
Re: dependency manager, better loop detector
« Reply #15 on: June 23, 2024, 11:05:45 pm »
So, it doesn't just "fail", but you know for sure there was a loop.

If it fails, that's the only reason: there is at least one loop  :D
but which one? And how many are there?  :o :o :o

You get the node that is part of the first loop it finds. So you get the loop. You'll just need to traverse the graph again from this node until it gets back to it, to list the other nodes that are part of this loop.
Store the loop as a list, and continue:
If you want to list all disjoint loops, then once you're found a loop, you continue by visiting the next node that isn't part of an already found loop, and go on until all nodes have been visited.

Assuming finding *all* loops is useful, make sure you really want that as it's obivously going to be more costly.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6827
  • Country: fi
    • My home page and email address
Re: dependency manager, better loop detector
« Reply #16 on: June 24, 2024, 08:44:35 am »
The array approach is based on Kahn's algorithm, and the array does contain the packages sorted according to increasing number of unfulfilled dependencies.  It is the fact that finding a specific package by its ID is slow that makes it not very good for identifying packages in a loop.  It is easy to iterate over all packages, though.

Because of that, I think I personally would prefer a different representation (but same logic and algorithm), perhaps
     typedef  struct pkg_deps {
         pkg_deps   *next;    // Main list
         pkg_deps   *stack;   // Mutable stack list during searches
         uint32_t    id;      // Numeric identifier
         uint32_t    tag:8;   // Loop detection mutable tag
         uint32_t    deps:24; // Number of dependencies left in dep[]
         pkg_deps   *dep[];   // Pointers to pkg_deps
     } pkg_deps;

This is pointer-size word aligned, and can be stored in an array or pool of such words; there is no need to allocate each one dynamically.  The length does vary: the dep[] member is a C99 or later flexible array member.  I'm not sure how myC would represent the same structure, or whether it is possible, though.  The pointer to each such structure, or its location, will not change after it has been added, though.

You use two lists to maintain the graph: one with packages with deps==0, and the other with deps>0.

Overall, for P packages with D dependencies total, this uses 16*P+4*D+8 bytes on 32-bit architectures, and 24*P+8*D+16 bytes on 64-bit architectures.  For P=1024, D<28670 on 32-bit architectures this takes less memory than the bitmap dependency matrix, and D<13310 on 64-bit architectures.  That means if you have 1024 packages or more, and each package has fewer than 28 (32-bit) or 13 (64-bit) dependencies on average, this is more compact than the bitmap matrix. For P=2048, D<59390, so on 64-bit, with 2048 packages or more with less than 29 dependencies on average, this is more compact.  So, if you have a thousand or more packages, this is likely to be more compact than the bitmap dependency matrix is.

(I might replace id with pkg_desc *desc;, a pointer to the package description; and change tag and deps to uint32_t, reducing the cost of clearing tag but increasing the memory use a small bit.)

Applying Kahn's algorithm, you can pick any package from the deps==0 list to install.  When it is empty, but the other list is not, you have a loop.

The stack member is used during a search to form the last-in-first-out or LIFO stack of nodes to visit.  The handle to this list is kept in one or two global variables, or local variables in a loop-based iterator.

The tag member is used for the seen and visited flags.  You have two global variables, seen and visited, initialized to 253 and 254, respectively.  Before starting a new search, you increment both by two.  If visited==256 or 0 (the larger overflows or wraps), you reset them to 1 and 2, respectively, and traverse both lists to clear all tag members to zero.  This way, the cost of clearing the two flags for all packages is amortized among many (127) searches.

For a depth-first search, you use a LIFO stack, inserting all dep[] nodes neither seen nor visited yet to the beginning of the stack list, of course marking each seen.

For a breadth-first search, you use a FIFO stack, appending all dep[] nodes neither seen nor visited yet to the end of the stack list, marking each seen.  In addition to the handle to the stack list, you maintain a pointer to the final node in the stack list, to avoid having to traverse the stack list for appending.

When you pop off the next node in the stack list, you first need to check if a previous iteration has visited it already.  With the stack list, all searches start at a specific node, which you put as the only element to the stack, and tag it seen.  You then iterate over a loop, popping off a node from the stack.  If it has been visited already, you skip the iteration and continue with the next node off the stack.  You tag the node visited, and check all its dep[]s.  A visited dep[] indicates a loop, so a loop-finding iterator can just return that node.  A set-finding iterator adds the popped-off node to the result stack, and tags all non-visited and non-seen dep[]s seen and adds them to the stack.



Technically, a simple loop is not what we want to extract.

What we really want, is the (preferably smallest) interconnected set left in the graph: not just the packages in the smallest loop, but all their dependencies too.

We know that there are no immediately fulfillable packages left when the m=0 list is empty, but you can still have connected rings, like A-B-C depending on each other, D-E-F depending on each other, but C and D also depending on each other –– i.e. A:B B:C C:AD D:CE, E:F, F:A.

If you only report A-B-C or D-E-F, the user is still hosed, because the loop itself is not self-contained.  Because of this, we actually need a two-part process: one is to find the/a loop, and the other is to collect all the dependencies for the packages involved in the loop.

Using the structure referred to above, a DFS will detect a loop when any of the nodes in dep[] are already visited.  That particular node is then known to be part of a loop, and is the starting point for the search for the whole connected set.  Such a search has to be done for each package left, until a loop is found.

(If you consider the above letter example, we might have started from package G that depends only on one of the packages in the interconnected set.  However, because we detect the loop because something in the loop already depends on the particular node, we know it has to be part of a loop.)

The entire dependency tree is then traversed starting at that particular node.  The stack member then has double use: when popping off nodes from the stack when visiting them, they are added to the resulting interconnected set list.

Detaching the entire interconnected set from the graph is done normally, one package at a time.  The package itself is removed from its list, and all lists are traversed removing it from the dep[] lists (by swapping the pointer with the final one, and decrementing deps).  Whenever deps changes, the package is moved to a different list (unless deps >= M).

(Note that in the A-G example, G would not be considered part of the interconnected set, because nothing in the loop directly or indirectly depends on G.)

I have not verified this suggestion works, but I do believe it should.  Let me know if I have forgotten or mistaken a detail, though!
« Last Edit: June 24, 2024, 08:51:58 am by Nominal Animal »
 
The following users thanked this post: DiTBho

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6827
  • Country: fi
    • My home page and email address
Re: dependency manager, better loop detector
« Reply #17 on: June 24, 2024, 01:45:04 pm »
As a very crude Proof-of-Concept, the following C (C99 or later) program seems to handle interconnected loops.  I've only written it and tested with a handful of different inputs, so expect bugs in it.  Please don't mind the ugly code, it's just a POC.

It reads the specified input file twice, expecting format
    LWS* package-name [ WS+ depends-on ] LWS*
where LWS* refers to zero or more ASCII whitespace characters, and WS+ to one or more spaces or tabs.  Each package should be listed only once.
Code: [Select]
// SPDX-License-Identifier: CC0-1.0
// Nominal Animal, 2024
#define  _POSIX_C_SOURCE  200809L
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>

typedef  struct pkg_desc  pkg_desc;
typedef  struct pkg_deps  pkg_deps;

struct pkg_deps {
    pkg_deps  *next;            // Linked list of nodes
    pkg_deps  *keep;            // Mutable local lists
    pkg_desc  *desc;            // Package description
    uint32_t   tag;             // Mutable search tag
    uint32_t   deps;            // Number of unfulfilled dependencies
    pkg_deps  *dep[];           // C99 flexible array member
};

struct pkg_desc {
    pkg_desc  *next;            // Linked list of nodes
    pkg_deps  *deps;            // Link to package dependency graph
    size_t     hash;            // DJB2 Xor hash of the name
    char       name[];          // C99 flexible array member
};

pkg_deps  *deps_none = NULL;    // Dependency nodes with deps == 0
pkg_deps  *deps_some = NULL;    // Dependency nodes with deps > 0
uint32_t   tag_seen    = 1;
uint32_t   tag_visited = 2;

static inline void new_tags(void) {
    tag_seen = tag_visited + 1;
    tag_visited = tag_seen + 1;
    if (tag_seen < 1 || tag_visited < 1) {
        tag_seen = 1;
        tag_visited = 2;
        for (pkg_deps *d = deps_none; d != NULL; d = d->next)
            d->tag = 0;
        for (pkg_deps *d = deps_some; d != NULL; d = d->next)
            d->tag = 0;
    }
}

static inline void remove_deps(pkg_deps *d) {
    pkg_deps **ref;
    pkg_deps *curr;

    if (!d)
        return;

    // Remove d from the no-dependencies list, if in there
    if (!d->deps && deps_none) {
        ref = &deps_none;
        curr = deps_none;

        while (curr) {
            if (curr == d) {
                curr = *ref = curr->next;
                d->next = NULL;
                break;
            } else {
                ref = &(curr->next);
                curr = curr->next;
            }
        }
    }

    // Remove d and dependencies on d from the some-dependencies list.
    ref = &deps_some;
    curr = deps_some;

    while (curr) {
        // d itself?
        if (curr == d) {
            curr = *ref = curr->next;
            d->next = NULL;
            continue;
        }

        // Dependencies on d
        size_t  n = curr->deps;
        size_t  i = 0;
        while (i < n) {
            if (curr->dep[i] == d) {
                curr->dep[i] = curr->dep[--n];
            } else {
                i++;
            }
        }
        curr->deps = n;

        // Dependencies left?
        if (n) {
            ref = &(curr->next);
            curr = curr->next;
            continue;
        }

        // No, move this to the no-dependency list.
        pkg_deps *temp = curr;
        curr = *ref = curr->next;
        temp->next = deps_none;
        deps_none = temp;
    }
}

typedef struct {
    pkg_deps *head;
    pkg_deps *tail;
} deps_fifo;
#define  DEPS_FIFO_INIT  { NULL, NULL }

static inline void deps_fifo_push(deps_fifo *fifo, pkg_deps *d) {
    d->keep = NULL;
    if (!fifo->tail) {
        fifo->head = d;
        fifo->tail = d;
    } else {
        fifo->tail->keep = d;
        fifo->tail = d;
    }
}

static inline pkg_deps *deps_fifo_pop(deps_fifo *fifo) {
    if (!fifo->head)
        return NULL;

    pkg_deps *d = fifo->head;
    fifo->head = d->keep;
    if (fifo->tail == d)
        fifo->tail = NULL;

    d->keep = NULL;
    return d;
}

// Garbles keep and tag.
static pkg_deps *find_loop(pkg_deps *root) {
    deps_fifo  stack = DEPS_FIFO_INIT;

    if (!root) {
        errno = EINVAL;
        return NULL;
    }

    // A new search, thus new visited and seen tags.
    new_tags();

    // Stack contains only the root element.
    root->tag = tag_seen;
    deps_fifo_push(&stack, root);

    while (1) {
        // Pop d off the stack.
        pkg_deps *d = deps_fifo_pop(&stack);
        if (!d) {
            errno = 0; // No loop found
            return NULL;
        }

        // If already visited, ignore it.
        if (d->tag == tag_visited)
            continue;

        // Mark it visited.
        d->tag = tag_visited;

        // Push all dependencies not seen yet to the stack.
        for (size_t i = 0; i < d->deps; i++) {
            if (d->dep[i]->tag == tag_visited)
                return d;   // Loop detected at this node
            if (d->dep[i]->tag == tag_seen)
                continue;

            d->dep[i]->tag = tag_seen;
            deps_fifo_push(&stack, d->dep[i]);
        }
    }
}

static pkg_deps *find_a_loop(void) {
    for (pkg_deps *d = deps_some; d != NULL; d = d->next) {
        pkg_deps *result = find_loop(d);
        if (result)
            return result;
    }
    // No loops found.
    errno = 0;
    return NULL;
}

static pkg_deps *interdependent(pkg_deps *root) {
    pkg_deps  *seen = NULL;
    pkg_deps  *keep = NULL;

    if (!root) {
        errno = ENOENT;
        return NULL;
    }

    new_tags();

    root->tag = tag_seen;
    root->keep = seen;
    seen = root;

    while (seen) {
        pkg_deps *next = seen->keep;

        // Prepend this to keep list
        seen->keep = keep;
        keep = seen;

        // Advance seen to the next
        seen = next;

        // Add dependencies to seen.
        for (size_t i = 0; i < keep->deps; i++) {
            if (keep->dep[i]->tag != tag_seen && keep->dep[i]->tag != tag_visited) {
                keep->dep[i]->tag = tag_seen;
                keep->dep[i]->keep = seen;
                seen = keep->dep[i];
            }
        }
    }

    return keep;
}

// Package descriptions

size_t     pkg_total = 0;       // Number of known packages
size_t     pkg_size  = 0;       // Size of the hash table
pkg_desc **pkg_hash  = NULL;    // Hash table

#define  PKG_HASH_MIN   255     // Initial number of hash table entries
#define  PKG_HASH_MAX   131071  // Maximum number of hash table entries
#define  PKG_HASH_LEN   16      // Preferred maximum chain length

static void  pkg_hash_resize(size_t new_size) {
    pkg_desc **new_hash = calloc(new_size, sizeof (pkg_desc *));
    if (!new_hash) {
        fprintf(stderr, "pkg_hash_resize(): Out of memory.\n");
        exit(EXIT_FAILURE);
    }

    for (size_t i = 0; i < pkg_size; i++) {
        pkg_desc *next = pkg_hash[i];

        while (next) {
            pkg_desc *curr = next;
            next = next->next;

            size_t  k = curr->hash % new_size;
            curr->next = new_hash[k];
            new_hash[k] = curr;
        }
    }

    free(pkg_hash);
    pkg_hash = new_hash;
    pkg_size = new_size;
}

pkg_desc *new_package(const char *name, size_t deps) {
    // Package name cannot be empty or NULL.
    if (!name || !*name) {
        errno = EINVAL;
        return NULL;
    }

    // Keep this in sync with find_package()!
    size_t  name_len = 0;
    size_t  name_hash = 5381;
    while (name[name_len])
        name_hash = (name_hash * 33) ^ (unsigned char)(name[name_len++]);

    // Check if already known
    if (pkg_size > 0)
        for (pkg_desc *p = pkg_hash[name_hash % pkg_size]; p != NULL; p = p->next)
            if (p->hash == name_hash && !strcmp(p->name, name)) {
                errno = EEXIST;
                return NULL;
            }

    // Allocate and clear to all zeros
    const size_t  desclen = sizeof (pkg_desc) + name_len + 1;
    pkg_desc     *desc = malloc(desclen);
    if (!desc) {
        fprintf(stderr, "new_package(): Out of memory.\n");
        exit(EXIT_FAILURE);
    }
    memset(desc, 0, desclen);

    // Initialize
    memcpy(desc->name, name, name_len + 1);
    desc->hash = name_hash;

    // Check if hash table needs resizing.
    if (!pkg_size) {
        // Initial hash table size is 255 entries
        pkg_hash_resize(255);
    } else
    if (pkg_size < PKG_HASH_MAX) {
        while (1) {
            size_t    n = 0;
            pkg_desc *p = pkg_hash[name_hash % pkg_size];
            while (p) {
                p = p->next;
                n++;
            }

            if (n < PKG_HASH_LEN)
                break;

            size_t  new_size = (pkg_size + 255) * 3 / 2;
            if (new_size > PKG_HASH_MAX) {
                pkg_hash_resize(PKG_HASH_MAX);
                break;
            }

            pkg_hash_resize(new_size);
        }
    }

    // Prepend
    const size_t  i = name_hash % pkg_size;
    desc->next = pkg_hash[i];
    pkg_hash[i] = desc;

    pkg_total++;

    return desc;
}

pkg_desc *find_package(const char *name) {
    if (!name || !*name) {
        errno = EINVAL;
        return NULL;
    } else
    if (!pkg_size) {
        errno = ENOENT;
        return NULL;
    }

    // Keep this in sync with new_package()!
    size_t  name_len = 0;
    size_t  name_hash = 5381;
    while (name[name_len])
        name_hash = (name_hash * 33) ^ (unsigned char)(name[name_len++]);

    for (pkg_desc *p = pkg_hash[name_hash % pkg_size]; p != NULL; p = p->next)
        if (p->hash == name_hash && !strcmp(p->name, name))
            return p;

    errno = ENOENT;
    return NULL;
}

// DJB2 Xor hash
static inline size_t hash(const char *buf) {
    size_t  result = 5381;
    if (!buf)
        return 0;

    while (*buf)
        result = (result * 33) ^ (unsigned char)(*(buf++));

    return result;
}

int tokenize(char *buf, size_t len) {
    char *      i = buf;
    char *const iend = buf + len;
    char *      o = buf;

    // Skip leading whitespace
    while (i < iend && (*i == '\0' || *i == '\t' || *i == '\n' || *i == '\v' || *i == '\f' || *i == '\r' || *i == ' ')) i++;

    // Nothing?
    if (i >= iend)
        return 0;

    // We have at least one token.
    int  tokens = 0;
    while (i < iend) {
        tokens++;

        while (i < iend && (*i != '\0' && *i != '\t' && *i != '\n' && *i != '\v' && *i != '\f' && *i != '\r' && *i != ' '))
            *(o++) = *(i++);

        while (i < iend && (*i == '\0' || *i == '\t' || *i == '\n' || *i == '\v' || *i == '\f' || *i == '\r' || *i == ' ')) i++;

        *(o++) = '\0';
    }

    // clear out the rest of the buffer.
    if (o < iend)
        memset(o, 0, (size_t)(iend - o));

    return tokens;
}

static void dot(FILE *out) {
    if (!out || ferror(out))
        return;

    fprintf(out, "digraph {\n");
    fprintf(out, "    rankdir = LR;\n");

    // List all nodes first.  Note: We should escape " and \ characters.
    for (pkg_deps *d = deps_none; d != NULL; d = d->next)
        fprintf(out, "    \"%p\" [ label=\"%s\" ];\n", (void *)d, d->desc->name);
    for (pkg_deps *d = deps_some; d != NULL; d = d->next)
        fprintf(out, "    \"%p\" [ label=\"%s\" ];\n", (void *)d, d->desc->name);

    // List all edges next
    for (pkg_deps *d = deps_some; d != NULL; d = d->next)
        for (size_t i = 0; i < d->deps; i++)
            fprintf(out, "    \"%p\" -> \"%p\";\n", (void *)d, (void *)(d->dep[i]));

    fprintf(out, "}\n");
}

int main(int argc, char *argv[]) {
    if (argc != 2 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        const char *arg0 = (argc > 0 && argv && argv[0] && argv[0][0]) ? argv[0] : "(this)";
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", arg0);
        fprintf(stderr, "       %s FILENAME\n", arg0);
        fprintf(stderr, "\n");
        fprintf(stderr, "This reads the specified file (twice).\n");
        fprintf(stderr, "Each line consists of a name and its requisites,\n");
        fprintf(stderr, "    NAME  [ REQUIRES ]*\n");
        fprintf(stderr, "separated by spaces or tabs.\n");
        fprintf(stderr, "\n");
        return (argc == 1 || argc == 2) ? EXIT_SUCCESS : EXIT_FAILURE;
    }

    FILE *in = fopen(argv[1], "re");
    if (!in) {
        fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
        exit(EXIT_FAILURE);
    }
    if (fseek(in, 0, SEEK_SET) == -1) {
        fprintf(stderr, "%s: Not a seekable file: %s.\n", argv[1], strerror(errno));
        exit(EXIT_FAILURE);
    }

    char   *line_buf = NULL;
    size_t  line_max = 0;
    while (1) {
        ssize_t  line_len = getline(&line_buf, &line_max, in);
        if (line_len < 1)
            break;

        int  fields = tokenize(line_buf, line_len);
        if (fields < 1)
            continue;   // Ignore empty lines.

        pkg_desc *desc = new_package(line_buf, fields - 1);
        if (!desc) {
            if (errno == EEXIST)
                fprintf(stderr, "%s: %s: Duplicate package name.\n", argv[1], line_buf);
            else
                fprintf(stderr, "%s: %s: %s.\n", argv[1], line_buf, strerror(errno));
            exit(EXIT_FAILURE);
        }

        pkg_deps *deps = malloc(sizeof (pkg_deps) + (size_t)(fields - 1) * sizeof (pkg_deps *));
        if (!deps) {
            fprintf(stderr, "%s: Out of memory for package %s.\n", argv[1], line_buf);
            exit(EXIT_FAILURE);
        }

        deps->next = NULL;
        deps->keep = NULL;
        deps->desc = desc;
        deps->tag  = 0;
        deps->deps = fields - 1;

        size_t  i = (size_t)(fields - 1);
        while (i-->0) deps->dep[i] = NULL;

        desc->deps = deps;

        // Prepend to the proper list
        if (deps->deps) {
            deps->next = deps_some;
            deps_some  = deps;
        } else {
            deps->next = deps_none;
            deps_none  = deps;
        }
    }
    if (ferror(in) || !feof(in)) {
        fprintf(stderr, "%s: Read error.\n", argv[1]);
        exit(EXIT_FAILURE);
    }

    if (pkg_total < 1) {
        fprintf(stderr, "%s: No packages listed.\n", argv[1]);
        exit(EXIT_FAILURE);
    }

    if (fseek(in, 0, SEEK_SET) == -1) {
        fprintf(stderr, "%s: Not a seekable file!\n", argv[1]);
        exit(EXIT_FAILURE);
    }

    while (1) {
        ssize_t  line_len = getline(&line_buf, &line_max, in);
        if (line_len < 1)
            break;

        int  fields = tokenize(line_buf, line_len);
        if (fields < 2)
            continue;   // Ignore empty lines and packages without dependencies.

        --fields;

        pkg_desc *desc = find_package(line_buf);
        if (!desc) {
            fprintf(stderr, "BUG: Lost package %s.\n", line_buf);
            exit(EXIT_FAILURE);
        }
        pkg_deps *deps = desc->deps;
        if (!deps) {
            fprintf(stderr, "BUG: Lost package %s dependencies.\n", line_buf);
            exit(EXIT_FAILURE);
        }

        const char *name = line_buf;
        size_t      i = 0;
        while (fields-->0) {
            // Skip to start of next name
            while (*name)
                name++;
            name++;

            // Find the named package
            pkg_desc *p = find_package(name);
            if (!p) {
                fprintf(stderr, "%s: Unknown package %s as a dependency for package %s.\n", argv[1], name, desc->name);
                exit(EXIT_FAILURE);
            }

            // Self-dependency is a bug
            if (p == desc) {
                fprintf(stderr, "%s: Package %s depends on itself.\n", argv[1], desc->name);
                exit(EXIT_FAILURE);
            }

            pkg_deps *d = p->deps;
            if (!d) {
                fprintf(stderr, "BUG: Lost dependency graph for package %s.\n", desc->name);
                exit(EXIT_FAILURE);
            }

            // If already listed, ignore
            {   size_t  k = 0;
                for (k = 0; k < i; k++)
                    if (deps->dep[k] == d)
                        break;
                if (k < i)
                    continue;
            }

            // Append
            deps->dep[i++] = d;
        }

        deps->deps = i;

        // A paranoid check
        if (!i) {
            fprintf(stderr, "BUG: Somehow lost all dependencies for package %s.\n", desc->name);
            exit(EXIT_FAILURE);
        }
    }
    if (ferror(in) || !feof(in)) {
        fprintf(stderr, "%s: Read error.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (fclose(in)) {
        fprintf(stderr, "%s: Error closing file: %s.\n", argv[1], strerror(errno));
        return EXIT_FAILURE;
    }
    free(line_buf);
    line_buf = NULL;
    line_max = 0;

    // For now, print the constructed graph in Graphviz DOT language.
    dot(stdout);

    // Pop off packages or interdependent groups of packages, one by one.
    while (deps_none || deps_some) {
        if (deps_none) {
            pkg_deps *d = deps_none;
            remove_deps(d);
            printf("// Package:  %s\n", d->desc->name);
        } else {
            pkg_deps *d = find_a_loop();
            if (!d) {
                fprintf(stderr, "BUG: There is a loop, but I couldn't find it.\n");
                exit(EXIT_FAILURE);
            }
            pkg_deps *k = interdependent(d);
            if (!k) {
                fprintf(stderr, "BUG: There is a loop containing package %s, but the interdependent set escaped me.\n", d->desc->name);
                exit(EXIT_FAILURE);
            }
            printf("// Packages:");
            while (k) {
                printf(" %s", k->desc->name);
                remove_deps(k);
                k = k->keep;
            }
            printf("\n");
        }
    }

    return EXIT_SUCCESS;
}

The output consists of a Graphviz DOT language directed graph of the package dependencies, followed by comment lines (starting with // that Graphviz tools will ignore) of format
    // Package:  package-name
for immediately-installable packages, and
    // Packages: package1 package2 ... packageN
for interconnected sets that must be installed concurrently, as they depend on each other.

When given input
Code: [Select]
A   B
B   C
C   A D
D   E C
E   F
F   D
G   A D
H   F G
J   K
K

which contains two loops, A-B-C-A and D-E-F-D, connected via C-D-C, a package G depending on A and D, package H depending on F and G, package J depending on K, and K being the only immediately installable package, the commented lines part of the output is
Code: [Select]
// Package:  K
// Package:  J
// Packages: B A C D F E
// Package:  G
// Package:  H

as one would expect: K can be immediately installed, then J.  The interdependent set A B C D E F must be installed next.  Finally, H can be installed, and then G.

The memory use is not horrible, no recursive functions (as the search tags and stacks are inherent in the data structure), and it seems to untangle anything I threw at it (only a handful!).  When there are multiple interdependent sets, it returns them in a semi-random order.

If you want to limit the dynamic memory allocations at run time, you need to do three passes over the input.
On the first pass, you count the number of packages P, the number of characters N in each package name (including the terminating NUL character, rounded up to the native word size for each name), and the number of edges/dependencies total D.
On a 32-bit architecture, you need to allocate a total of 12*P+N bytes for the pkg_descs, a total of 20*P+4*D bytes for the pkg_deps, and an array of pointers to pkg_descs for the hash table to speed up the package name lookups.
On the second pass, you construct the pkg_descs, and record the number of dependencies each package has, like the current code does on the first pass.
On the third pass, you fill in the pkg_deps structures by looking up the pointer to the dependencies of each named package.

As the code does no recursive function calls –– the search stacks and node tagging inherent in the pkg_deps structures ––, the memory usage is very well controlled.

On a 32-bit architecture, with a 1024-entry hash table, 1024 packages that have on average 22-character names, if each package has less than 11 dependencies on average, all of this fits in less memory than the dependency bit matrix uses; for 2048 packages, if less than 43 dependencies on average.

Of course, one can use a sparse bitmap to represent the dependency bit matrix, too.  My point is that even though this uses a pointer-based graph, its memory use is very well defined and controlled, and being more efficient than a plain bitmap adjacency matrix for thousand or more packages or so, gives a good intuitive picture of its memory demands.
« Last Edit: June 24, 2024, 01:54:41 pm by Nominal Animal »
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #18 on: June 25, 2024, 06:46:34 am »
Assuming finding *all* loops is useful

they're useful for what Nominal Animal brilliantly exemplified in its "interdependent" example: things you need to plan for, and plan carefully if you're on a slow machine(1), in order to resolve and move forward.

edit
(1) e.g. { rb532(mips32r2@400Mhz), RSP(mips32r2@600Mhz), ... }
« Last Edit: June 25, 2024, 07:12:14 am by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #19 on: June 25, 2024, 07:11:11 am »
I wrote myC so in the form of application that relies on different modules provided by different libraries

Code: [Select]
lib_file_utils_v2
lib_filename_filter_v0.3
lib_filename_utils_v11
lib_lexer_reserved_word_language_mycc_v7b
lib_lexer_v7b
lib_lexer_tokenize_v7b
lib_mycc1_parser_v3
lib_mycc1_machine_level_mips32r2_v1
lib_mycc1_machine_level_mips5pp_v2
...
lib_basen2_v3
lib_debug_v2
lib_mem_xstatic_v2
lib_mystring_v2
lib_bitmap_matrix_v5
lib_string_fuzzy_match_v1

So I develop all the tools and applications, enriching the libraries more and more, they all evolve and can bell all reused for new tools and applications.

There are around 1000 of them and they are starting to suffer from interdependence problems, which, if until now I managed manually because before there were many fewer libraries (~100) and it was possible to do it with pen and paper in a couple of minutes. Now it becomes unmanageable without wasting a couple of days!

This is pointer-size word aligned, and can be stored in an array or pool of such words; there is no need to allocate each one dynamically.  The length does vary: the dep[] member is a C99 or later flexible array member.  I'm not sure how myC would represent the same structure, or whether it is possible, though.  The pointer to each such structure, or its location, will not change after it has been added, though.

flexible array members are not yet supported in myC because the underlying monad of the array class doesn't know exactly how to react when seeing a dynamic sizing. Indeed, as the monad is now(1), if it sees that the size-changes, it is strongly tempted (default beahvior to prevent out-by-one) to raise a panic (lib_debug2), spit out the call-trace history up to the point, and plant everything in an ICE-break, so you are forced to open the debugger and see exactly where it crashed.

it's not a problem, I just have to think about it for a moment(1), or write a temporary thing in gcc  :-+

(1) can be extended, but ... if in theory it could be done, never done, and I suspect it won't work on the first try
« Last Edit: June 25, 2024, 07:16:00 am by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6827
  • Country: fi
    • My home page and email address
Re: dependency manager, better loop detector
« Reply #20 on: June 25, 2024, 09:47:31 am »
flexible array members are not yet supported in myC because the underlying monad of the array class doesn't know exactly how to react when seeing a dynamic sizing.
That's not a problem, as you don't need it if you use array-based storage, and replace pointers with offsets/indexes in the target array.  The target array is always a specific, known one, and doesn't need to be identified.

This kind of structures-in-arrays is very common in Fortran code.  Essentially, you replace each pointer with an index or offset in the array it resides in.

Let's say you have four dynamically allocated arrays:
    char  *text;  // Package name strings, NUL-separated
    word  *desc;  // Package descriptors
    word  *hash;  // Hash table for package name lookups
    word  *deps;  // Package dependencies

Here, word denotes an integer type of native size.  You also have two initial offsets to the deps[],
    word   deps_none;  // First package with no unfulfilled dependencies
    word   deps_some;  // First package with some unfulfilled dependencies

You'd also probably want to record the maximum valid offset to each of the arrays, noting that the offset is always to the start of each entry.

Each package descriptor in desc consists of four consecutive words:
    next    Offset of the next package descriptor in the same hash table entry
    hash    DJB2 Xor hash of the package name
    deps    Offset of the dependencies in the deps[] array
    name    Offset of the package name in the text[] array

The hash table in hash contains either the offset (in desc) to a package with corresponding hash, or NONE (WORD_MAX if unsigned, WORD_MIN if signed) to indicate empty slot.  Package names hashing to the same slot form a singly-linked list.

The interesting bit is the deps[] array, where each entry is 5 or more words long:
    next    Offset of the next package, in deps[] array
    keep    Offset of the next package in any internal stack or list
    desc    Offset of the package descriptor in desc[] array
    tag     Traversal tag, preferably unsigned
    deps    Number of dependencies that follow
  [ dep0 ]  Offset of the first dependency, in deps[] array
  [    : ] 

When you delete a dep, you decrement deps by one, and swap the deleted offset and the one at offset deps relative to dep0.
This way, nothing is deleted, and if you want to be able to restore the graph to its original state afterwards, just add a word (actual_deps) to each entry containing the original number of dependencies.  Then, the restoration involves going through the deps[] array, chaining next to either deps_none or deps_some, clearing keep to zero, setting deps=actual_deps, and advancing to the next structure by incrementing the offset by 6+actual_deps.  It is a linear pass, with the same cost as clearing all tags (which one might have to do in some corner cases).

You never move the entries, they always stay put, to keep the indexes/offsets working; just like the structures never move after they've been allocated in the pointer version I showed.

For maintainability and robustness, I'd use helper accessor functions, but as you can see, the code itself changes very, very little (compared to the "real" pointer-based version).  The tags are easier to maintain if unsigned, but all the other fields can be either signed or unsigned.  Because your word size is at least 4 bytes, it means the nonnegative values a signed native word can represent cover all possible offsets/indexes in an array that fits in memory.

If you use helper accessor functions, you can add very simple checks to verify the offset or index is within the valid range.  This makes the approach quite memory-safe, too.

If you run out of memory, note that only deps is heavily used during traversal, the other arrays could be file-based at that point.

However, on 32-bit, we're still only at 40*P+4*D+N bytes of RAM use, where P is the number of packages, D is the total number of dependencies, and N is the total number of bytes needed for the package names including their terminating NUL bytes, assuming you use a hash table with P slots.  (It is a good guess for the size, because collisions are in linked lists.)

For 10,000 packages with 10 dependencies each on average and 23-character package names, we're only at around one megabyte.  Memory use is linear, so ten megabytes will suffice for about 100,000 packages.  Debian and Ubuntu have around 60,000 in the main repos, for comparison.

To read the data with only four dynamic memory allocations, you need three passes:
  • pass: Discover the total number of packages P, total number of dependencies D, and the number of characters in package names N including terminating NULs.

    After this pass, you allocate all arrays.

    Note that the number of characters in package names is easy to calculate only if you have format like mine, with all dependencies a package has listed in one go.  If you have package - depends-on pairs, it becomes much, much harder, and has to be done dynamically.
     
  • pass: Fill in desc[], text[], and deps[] including the dependency count but excluding the actual dependencies.

    After or during this pass, you construct the hash table, for fast package name to desc to deps lookup.
     
  • pass: Fill in deps[] by looking up each dependency in the hash table.
I definitely like the option of generating the dependency graph in Graphviz DOT format, as that helped me verify my code generated the graph I expected it to, and it is much more intuitive than any list form.  I suggest you consider the same, at least as a debugging tool.
« Last Edit: June 25, 2024, 09:50:33 am by Nominal Animal »
 
The following users thanked this post: DiTBho

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #21 on: July 26, 2024, 11:01:30 pm »
Code: [Select]
bitmap: 131072 byte, 32768 units of 32 bit,
bitmap: 131072 byte, 32768 units of 32 bit,
fifo_dep_data_clean
fifo_dep_init
stack_dep_data_clean
stack_dep_init
matrix(11 x 11)
using not_recursive method
+ dep0000 as lib_none (internal use)
+ dep0001 as lib_A
    looking for dep0002 aka lib_B ... success (1-2)
+ dep0002 as lib_B
    looking for dep0003 aka lib_C ... success (2-3)
+ dep0003 as lib_C
    looking for dep0001 aka lib_A ... success (3-1)
    looking for dep0004 aka lib_D ... success (3-4)
+ dep0004 as lib_D
    looking for dep0005 aka lib_E ... success (4-5)
    looking for dep0003 aka lib_C ... success (4-3)
+ dep0005 as lib_E
    looking for dep0006 aka lib_F ... success (5-6)
+ dep0006 as lib_F
    looking for dep0004 aka lib_D ... success (6-4)
+ dep0007 as lib_G
    looking for dep0001 aka lib_A ... success (7-1)
    looking for dep0004 aka lib_D ... success (7-4)
+ dep0008 as lib_H
    looking for dep0006 aka lib_F ... success (8-6)
    looking for dep0007 aka lib_G ... success (8-7)
+ dep0009 as lib_J
    looking for dep0010 aka lib_K ... success (9-10)
+ dep0010 as lib_K
------- adjacency matrix -------
      |                     |
0001: | 0 1 0 0 0 0 0 0 0 0 |
0002: | 0 0 1 0 0 0 0 0 0 0 |
0003: | 1 0 0 1 0 0 0 0 0 0 |
0004: | 0 0 1 0 1 0 0 0 0 0 |
0005: | 0 0 0 0 0 1 0 0 0 0 |
0006: | 0 0 0 1 0 0 0 0 0 0 |
0007: | 1 0 0 1 0 0 0 0 0 0 |
0008: | 0 0 0 0 0 1 1 0 0 0 |
0009: | 0 0 0 0 0 0 0 0 0 1 |
0010: | 0 0 0 0 0 0 0 0 0 0 |
step1: removing lib_K(dep10) ...
step2: removing lib_J(dep9) ...
loop detected, solving  ...
((9,10))
  (9,10)-> solution
((8,7))
  (8,7)-> ...
    (7,1)-> ...
      (1,2)-> ...
        (2,3)-> ...
          (3,1)-> ...
            (1,2)-> djvu!
          (3,4)-> ...
            (4,3)-> ...
              (3,1)-> djvu!
              (3,4)-> djvu!
            (4,5)-> ...
              (5,6)-> ...
                (6,4)-> ...
                  (4,3)-> djvu!
                  (4,5)-> djvu!
    (7,4)-> ...
      (4,3)-> djvu!
      (4,5)-> djvu!
((8,6))
  (8,6)-> ...
    (6,4)-> djvu!
((7,4))
  (7,4)-> djvu!
((7,1))
  (7,1)-> djvu!
((6,4))
  (6,4)-> djvu!
((5,6))
  (5,6)-> djvu!
((4,3))
  (4,3)-> djvu!
((4,5))
  (4,5)-> djvu!
((3,4))
  (3,4)-> djvu!
((3,1))
  (3,1)-> djvu!
((2,3))
  (2,3)-> djvu!
((1,2))
  (1,2)-> djvu!
------- djvu -------
      | + + + + + +         |
0001: | 0 1 0 0 0 0 0 0 0 0 |
0002: | 0 0 1 0 0 0 0 0 0 0 |
0003: | 1 0 0 1 0 0 0 0 0 0 |
0004: | 0 0 1 0 1 0 0 0 0 0 |
0005: | 0 0 0 0 0 1 0 0 0 0 |
0006: | 0 0 0 1 0 0 0 0 0 0 |
0007: | 1 0 0 1 0 0 0 0 0 0 |
0008: | 0 0 0 0 0 1 1 0 0 0 |
0009: | 0 0 0 0 0 0 0 0 0 0 |
0010: | 0 0 0 0 0 0 0 0 0 0 |
step3: removing { lib_A(dep1) lib_B(dep2) lib_C(dep3) lib_D(dep4) lib_E(dep5) lib_F(dep6) } ...
step4: removing lib_G(dep7) ...
step5: removing lib_H(dep8) ...

Nominal Animal's testcase  :o :o :o
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 
The following users thanked this post: Nominal Animal

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #22 on: July 26, 2024, 11:04:31 pm »
Code: [Select]
       is_ok = matrix_topological_ordering(p_dep_man);
       if (is_ok isEqualTo False)
       {
           printf("loop detected, solving  ...\n");
           matrix_loop_solve(p_dep_man);
           is_ok = matrix_topological_ordering(p_dep_man);
        }
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6827
  • Country: fi
    • My home page and email address
Re: dependency manager, better loop detector
« Reply #23 on: July 27, 2024, 12:02:54 am »
Nice to see you got the matrix-based approach to work for that extremely pathological case!  :-+

It is always interesting to find multiple ways of solving the same or similar problems, because that's more tools and options in ones toolbox.
 

Offline DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4227
  • Country: gb
Re: dependency manager, better loop detector
« Reply #24 on: July 27, 2024, 12:38:28 am »
Code: [Select]
private boolean_t dep_man_loop_detect_recursive
(
    p_dep_man_t p_dep_man,
    p_fifo_dep_data_t p_data_in
)
{
    boolean_t         ans;
    p_fifo_dep_data_t p_data_next;
    fifo_dep_data_t   data_next;
    p_bitmatrix_t     p_matrix0;
    p_bitmatrix_t     p_matrix1;
    uint32_t          i0; /* icol */
    uint32_t          i1; /* irow */
    uint32_t          iblock;
    boolean_t         cell;
    uint32_t          item_n;
    boolean_t         is_djvu;
    boolean_t         is_loop;
    boolean_t         is_at_least_one;
    boolean_t         is_special_path;

    p_matrix0 = p_dep_man->context.matrix.adjacency;
    p_matrix1 = p_dep_man->context.matrix.crossedge;

    item_n      = p_matrix0->row_n;
    p_data_next = get_address(data_next);

    i1      = p_data_in->lib_id; /* curr */
    i0      = p_data_in->dep_id; /* curr */
    iblock  = bitmatrix_iblock_get(p_matrix1, i1, i0);
    is_djvu = bitmatrix_cell_get(p_matrix1, iblock);
    bitmatrix_cell_set(p_matrix1, iblock);

    dep_man_level_inc(p_dep_man);
    dep_man_level_show(p_dep_man);
    printf("(%lu,%lu)-> ", i1, i0);

    i1 = p_data_in->dep_id; /* curr to be inspected */

    is_at_least_one = False;
    is_special_path = False;

    if (is_djvu)
    {
        printf("djvu!");
        printf("\n");

        iblock = bitmatrix_iblock_get(p_matrix1, 0, p_data_in->dep_id);
        cell   = bitmatrix_cell_get(p_matrix1, iblock);
        if (cell isEqualTo False)
        {
            p_dep_man->context.item.djvu.n++;
        }

        matrix_dep_del(p_matrix1, p_data_in->dep_id);
    }
    else
    {
        for (i0 = 1; i0 < item_n; i0++)
        {
            iblock = bitmatrix_iblock_get(p_matrix0, i1, i0);
            cell   = bitmatrix_cell_get(p_matrix0, iblock);

            if (cell)
            {
                p_data_next->lib_id = p_data_in->dep_id; /* this line is right */
                p_data_next->dep_id = i0;

                if (is_at_least_one isEqualTo False)
                {
                    printf("... \n");
                }

                is_loop = dep_man_loop_detect_recursive(p_dep_man, p_data_next);

                if (is_loop isEqualTo False)
                {
                    dep_man_level_show(p_dep_man);
                    printf("clean (%lu,%lu) ", p_data_in->lib_id, p_data_in->dep_id);
                    printf("\n");

                    iblock = bitmatrix_iblock_get(p_matrix1, p_data_in->lib_id, p_data_in->dep_id);
                    bitmatrix_cell_clr(p_matrix1, iblock);
                }

                is_at_least_one = (is_at_least_one logicalOr cell);
                is_special_path = (is_special_path logicalOr is_loop);
            }
        }

        if (is_at_least_one isEqualTo False)
        {
            i1     = p_data_in->lib_id;
            i0     = p_data_in->dep_id;
            iblock = bitmatrix_iblock_get(p_matrix1, i1, i0);
            bitmatrix_cell_clr(p_matrix1, iblock);

            printf("solution ");
            printf("\n");
        }
    }

    ans = ((is_special_path logicalAnd is_at_least_one) logicalOr is_djvu);

    dep_man_level_dec(p_dep_man);
    return ans;
}

Code: [Select]
private boolean_t dep_man_loop_detect_not_recursive
(
    p_dep_man_t p_dep_man
)
{
    boolean_t          ans;
    p_stack_dep_data_t p_data;
    boolean_t          is_done0;
    boolean_t          is_done1;
    boolean_t          is_djvu;

    p_data = get_address(p_dep_man->context.fsm.data);

    do_l0_init(p_dep_man);
    is_done1 = False;
    while (is_done1 isEqualTo False)
    {
        is_djvu = do_l1_prologue(p_dep_man);
        if (is_djvu)
        {
            do_l2_djvu(p_dep_man);
        }
        else
        {
            is_done0 = False;
            while (is_done0 isEqualTo False)
            {
                is_done0 = do_l2_scan(p_dep_man);
            }
            do_l4_conclusion(p_dep_man);
        }
        is_done1 = do_l5_epilogue(p_dep_man);
    }

    ans = p_data->is.loop;
    return ans;
}

I started from the above recursive algorithm.
Once it started working, I wrote it in a non-recursive form.
Going to segregate the various processing areas { prologue, scan, deep, conclusion, epilogue }
and for each of them I created a function()
handled by a state machine that helps to simulate the recursive-call (scan->deep), with the saving and restoring of the context.

Algorithms use
  • adjacency matrix, bitboolean cell (1bit)
  • crossedge matrix, bitboolean cell (1bit) <--- for djvu-s
  • {lib_id, dep_id} FIFO
  • fsm-data stack (only for the not-recursive Algorithm)
« Last Edit: July 27, 2024, 05:40:43 pm by DiTBho »
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf