2016-12-24

Resolving directed acyclic graph with ordered edges in SQL

Suppose you have a table holding the adjacency set of a directed acyclic graph (DAG). It has two columns, parent and child, each row indicating an edge from one node to another. With a few temporary tables, it's easy enough to resolve any set of nodes into a fuller set of those nodes and all their descendants. You can use it to represent group membership, and find rules that match a given node directly or indirectly. You can also create a deterministic function to do the resolution for you, and use it in triggers to prevent cycles.

Now add another column, seqno, that orders the direct children of a node. Now you want to resolve a node to a sequence with depth-first traversal. SQL doesn't do recursion, so how do you go about it? Here's some example data:

parent child seqno
1 2 0
1 3 1
1 4 2
2 4 0
4 5 0
4 6 1

So 1 imports 2, 3 and 4, but 2 also imports 4, which also imports 5 and 6. Although 1 directly imports 4 after 3, 4 should appear before 3 in the result. The expected sequence is 1, 2, 4, 5, 6, 3, 4, 5, 6, from which we remove the later duplicates to give 1, 2, 4, 5, 6, 3.

Here's a technique that scans the graph in two passes. In the first, we descend level by level, looking for the highest seqno of any node's outgoing edges (the maximum fanout) at that level. In our example, that gives us 3, 2, 2, 0. We then work back, starting with a counter of zero, multiplying the number in the table by the counter and adding one, which becomes the new counter: 0×0+1=1; 2×1+1=3; 2×3+1=7. Note that we don't use the first fanout 3. We reverse the result 1, 3, 7, and associate these factors with each level of the graph, so level 1 maps to 7, 2 to 3, and 3 to 1.

Now we start the second pass, this time keeping a computed sequence number. The root element 1 is assigned 0 as its sequence number. Then we find its children are 2, 3 and 4 in sequence 0, 1 and 2. We assign their sequence numbers as the root's assigned number 0, plus 1, plus the original sequence number times the factor at the current level: so 2 is placed at 0+1+0×7=1, 3 is placed at 0+1+1×7=8, and 4 is placed at 0+1+2×7=15.

When we move to the next level, using factor 3, 2 at 1 imports 4 with seqno 0, so 4 is placed at 1+1+0×3=2. We now have 4 at both 2 and 15, so we can lose the later one at 15. That also puts 4 (at 2) before 3 (at 8).

Essentially, what we're doing is leaving a space at each level for a full tree (but with potentially different fanout at each level). We then use the highest factor at the top level to spread the first generation out, leaving enough gap between two siblings for the descendants of the 'older' sibling.

Here's the body of a stored function that takes one node id, and resolves it to a sequence that represents depth-first traversal, with later duplicates removed. Let's hope I've got it right!:

BEGIN
DECLARE idlist TEXT DEFAULT '';
DECLARE scl INT UNSIGNED;
DECLARE lyr INT UNSIGNED;
DECLARE prv INT UNSIGNED;

-- Create a table to hold fan-outs for each level of the graph.
DROP TEMPORARY TABLE IF EXISTS fanouts;
CREATE TEMPORARY TABLE fanouts
   (layer INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    max_fanout INT UNSIGNED,
    factor INT UNSIGNED DEFAULT NULL);

-- Create a table to hold candidates, and populate it with the root
-- node.
DROP TEMPORARY TABLE IF EXISTS cands;
CREATE TEMPORARY TABLE cands (id INT UNSIGNED);
INSERT INTO cands (id) VALUES (root);

-- Create a table to hold parents of the current candidates.
DROP TEMPORARY TABLE IF EXISTS parents;
CREATE TEMPORARY TABLE parents (id INT UNSIGNED);

-- Work out the fanouts for each level.
WHILE EXISTS (SELECT id FROM cands) DO
   -- Get the maximum fan-out at this level.
   INSERT INTO fanouts
      SELECT NULL AS layer,
             MAX(seqno) + 1 AS max_fanout,
             NULL AS factor
         FROM edges
         WHERE parent IN (SELECT id FROM cands)
         GROUP BY parent
         ORDER BY max_fanout DESC
   LIMIT 1;

   -- Work out what's at the next level.
   INSERT INTO parents
      SELECT edges.child AS id
         FROM cands
   INNER JOIN edges
     ON edges.parent = cands.id;

   -- Make the next level the current level, and reset the next level.
   DELETE FROM cands;
   INSERT INTO cands
      SELECT DISTINCT id FROM parents;
   DELETE FROM parents;
END WHILE;

-- Work out the scale factor for each level.
SET scl = 0;
SET prv = 0;
SELECT MAX(layer) FROM fanouts INTO lyr;
WHILE lyr > 0 DO
   UPDATE fanouts
      SET factor = scl * prv + 1
      WHERE layer = lyr;
   SELECT max_fanout, factor
      FROM fanouts
      WHERE layer = lyr
      INTO prv, scl;
   SET lyr = lyr - 1;
END WHILE;



-- SECOND PHASE


-- This table holds the final result.  seqno defines the order.
DROP TEMPORARY TABLE IF EXISTS done;
CREATE TEMPORARY TABLE done
   (id INT UNSIGNED, seqno INT UNSIGNED);

-- This table lists nodes that we have to add to the result.
DROP TEMPORARY TABLE IF EXISTS cands;
CREATE TEMPORARY TABLE cands
   (id INT UNSIGNED, seqno INT UNSIGNED);

DROP TEMPORARY TABLE IF EXISTS parents;
CREATE TEMPORARY TABLE parents
   (id INT UNSIGNED, seqno INT UNSIGNED);

INSERT INTO cands (id, seqno) VALUES (root, 0);

SET lyr = 1;
WHILE EXISTS (SELECT id FROM cands) DO
   SELECT factor FROM fanouts WHERE layer = lyr INTO scl;
   SET lyr = lyr + 1;

   INSERT INTO parents
      SELECT child AS id,
             cands.seqno + 1 + scl * edges.seqno AS seqno FROM cands
   JOIN edges
     ON edges.parent = cands.id
   ORDER BY cands.seqno ASC, edges.seqno ASC;

   -- Acknowledge that we have the parents of the current candidates.
   INSERT INTO done
      SELECT id, seqno FROM cands;
   DELETE FROM cands;

   -- Keep the earliest entry from done and parents.
   DELETE FROM parents
      WHERE id IN (SELECT id FROM done WHERE done.seqno < parents.seqno);
   DELETE FROM done
      WHERE id IN (SELECT id FROM parents WHERE done.seqno > parents.seqno);

   -- Make the parents the new candidates.
   INSERT INTO cands
      SELECT id, seqno FROM parents;
   DELETE FROM parents;
END WHILE;

SELECT GROUP_CONCAT(id ORDER BY seqno ASC SEPARATOR ',')
    FROM done
    INTO idlist;
RETURN idlist;
END;

Happy now?

No comments:

Post a Comment