0

Detecting cycles in tag parent-child relationship in sqlite3 text

epilys wrote :

Problem statement

Story tags can have multiple parents. This establishes a hierarchy where a tag can have a tree of descendants. We want this directed graph to be acyclic (DAG) because cycles don't make sense in a topical hierarchy, and when we query the graph we would have to be extra cautious to not fall in infinite loops.

Database schema

Here are the tables created by django:

CREATE TABLE IF NOT EXISTS "sic_tag" (
  "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
  "name" varchar(40) NOT NULL UNIQUE,
  "created" datetime NOT NULL,
  "hex_color" varchar(7) NULL
);

CREATE TABLE IF NOT EXISTS "sic_tag_parents" (
  "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
  "from_tag_id" integer NOT NULL REFERENCES "sic_tag" ("id") DEFERRABLE INITIALLY DEFERRED,
  "to_tag_id" integer NOT NULL REFERENCES "sic_tag" ("id") DEFERRABLE INITIALLY DEFERRED
);

The table sic_tag_parents contains one row for each directed edge.

Recursive Common Table Expression (CTE) to the rescue

sqlite3 provides us with the ability to perform recursive queries with Common Table Expressions. We can use this to write a query that returns all paths within the graph. Then, when we insert a new parent-child relationship, we check if there already exists a directed path from the child to its parent.

Unfortunately it's not possible to make a table CHECK CONSTRAINT since the constraint must refer to the table, which exists only after the CREATE TABLE statement is finished. We can instead make the check in a BEFORE INSERT trigger and RAISE(ABORT) if a cycle is detected, thus preventing the insertion.

Firstly, the trigger shall execute a SELECT query for each inserted row. In order to raise an exception, we have to SELECT it:

CREATE TRIGGER cycle_check
BEFORE INSERT ON sic_tag_parents
FOR EACH ROW
BEGIN
  SELECT RAISE(ABORT, 'Cycle detected');
END

Secondly, in order to abort only when a specific condition is met, we can use a WHERE EXISTS:

CREATE TRIGGER cycle_check
BEFORE INSERT ON sic_tag_parents
FOR EACH ROW
BEGIN
  SELECT RAISE(ABORT, 'Cycle detected') WHERE EXISTS (...);
END

In the EXISTS expression we shall put a recursive CTE that returns something (anything) if a cycle exists. If anything is returned, EXISTS evaluates to true and RAISE is selected.

This is the CTE query:

WITH RECURSIVE w(parent, last_visited, already_visited, cycle) AS (
      SELECT DISTINCT to_tag_id AS parent, from_tag_id AS last_visited, to_tag_id AS already_visited, 0 AS cycle FROM sic_tag_parents

      UNION ALL

      SELECT t.to_tag_id AS parent, t.from_tag_id AS last_visited, already_visited || ', ' || t.to_tag_id, already_visited LIKE '%'||t.to_tag_id||"%" FROM sic_tag_parents AS t JOIN w ON w.last_visited = t.to_tag_id
      WHERE NOT cycle
)
SELECT already_visited, cycle FROM w WHERE last_visited = NEW.to_tag_id AND already_visited LIKE '%'||NEW.from_tag_id||'%'

Line-by-line:

WITH RECURSIVE w(parent, last_visited, already_visited, cycle) AS..

This defines a temporary table/view called w with these columns as an expression. This is similar to defining regular views as the result of a SELECT statement.

SELECT DISTINCT to_tag_id AS parent, from_tag_id AS last_visited, to_tag_id AS already_visited, 0 AS cycle FROM sic_tag_parents

Here we select every edge as the start of a potential path. These edges are paths of length 1, and are also acyclic, hence we select 0 as cycle. This line is performed one time at the beginning of the CTE, it's the basis of the recursion.

UNION ALL

The base select is united with the select that follows:

SELECT t.to_tag_id AS parent, t.from_tag_id AS last_visited, already_visited || ', ' || t.to_tag_id, already_visited LIKE '%'||t.to_tag_id||"%" FROM sic_tag_parents AS t JOIN w ON w.last_visited = t.to_tag_id
WHERE NOT cycle

Here we JOIN sic_tag_parents with w itself, by adding edges to the already existing paths in the previous iteration of w. The path is built as a string of comma separated ids. cycle is true if to_tag_id, the parent, already exists in the previous path. The WHERE NOT cycle is necessary to prevent infinite recursion.

SELECT already_visited, cycle FROM w WHERE last_visited = NEW.to_tag_id AND already_visited LIKE '%'||NEW.from_tag_id||'%'

Finally, we select the paths that begin with the parent and contain the child.

The entire trigger:

CREATE TRIGGER cycle_check
BEFORE INSERT ON sic_tag_parents
FOR EACH ROW
BEGIN
    SELECT RAISE(ABORT, 'Cycle detected') WHERE EXISTS (
      WITH RECURSIVE w(parent, last_visited, already_visited, cycle) AS (
            SELECT DISTINCT to_tag_id AS parent, from_tag_id AS last_visited, to_tag_id AS already_visited, 0 AS cycle FROM sic_tag_parents

            UNION ALL

            SELECT t.to_tag_id AS parent, t.from_tag_id AS last_visited, already_visited || ', ' || t.to_tag_id, already_visited LIKE '%'||t.to_tag_id||"%" FROM sic_tag_parents AS t JOIN w ON w.last_visited = t.to_tag_id
            WHERE NOT cycle
      )
      SELECT already_visited, cycle FROM w WHERE last_visited = NEW.to_tag_id AND already_visited LIKE '%'||NEW.from_tag_id||'%'
    );
END;

Appendix

We can visualise the paths with a CTE.

First, create a utility view that associates edges with their corresponding tag names:

CREATE TEMPORARY VIEW sic_tag_parents_name AS select f.name AS from_name, from_tag_id, t.name AS to_name, to_tag_id from sic_tag_parents join sic_tag AS f on f.id = from_tag_id join sic_tag AS t on t.id = to_tag_id;

Sample output:

sqlite> SELECT * FROM sic_tag_parents_name LIMIT 2;
from_name    from_tag_id  to_name           to_tag_id
-----------  -----------  ----------------  ---------
programming  57           computer science  102
debugging    137          programming       57

Then write a similar CTE as a view that creates a path string out of the tag names:

CREATE TEMPORARY VIEW sic_tag_parents_paths AS 
WITH RECURSIVE w(parent, last_visited, parent_name, last_visited_name, already_visited, cycle, length) AS (
      SELECT DISTINCT to_tag_id AS parent, from_tag_id AS last_visited, to_name AS parent_name, from_name AS last_visited_name, to_name AS already_visited, 0 AS cycle, 1 as length FROM sic_tag_parents_name

      UNION ALL

      SELECT t.to_tag_id AS parent, t.from_tag_id AS last_visited, t.to_name, t.from_name,  t.to_name || ' > ' || already_visited, already_visited LIKE '%'||t.to_name||"%", length+1 AS length FROM sic_tag_parents_name AS t JOIN w ON w.last_visited = t.to_tag_id
      WHERE NOT cycle
)
SELECT * FROM w;

The output:

sqlite> select distinct(last_visited_name || ' > ' || already_visited) as path, length from sic_tag_parents_paths order by length desc limit 25;
path                                                                 length
-------------------------------------------------------------------  ------
c > programming languages > programming > computer science           3
c++ > programming languages > programming > computer science         3
clojure > programming languages > programming > computer science     3
elixir > programming languages > programming > computer science      3
erlang > programming languages > programming > computer science      3
go > programming languages > programming > computer science          3
haskell > programming languages > programming > computer science     3
javascript > programming languages > programming > computer science  3
java > programming languages > programming > computer science        3
lisp > programming languages > programming > computer science        3
lua > programming languages > programming > computer science         3
python > programming languages > programming > computer science      3
ruby > programming languages > programming > computer science        3
rust > programming languages > programming > computer science        3
swift > programming languages > programming > computer science       3
emacs > commandline > unix > operating systems                       3
emacs > commandline > unix                                           2
debugging > programming > computer science                           2
programming languages > programming > computer science               2
vcs > programming > computer science                                 2
commandline > unix > operating systems                               2
linux > unix > operating systems                                     2
openbsd > unix > operating systems                                   2
dragonflybsd > unix > operating systems                              2
sqlite3 > databases > computer science                               2

You must be vouched for by a vouched user to participate.