Tuesday, September 3, 2013

Common table expressions and circular references




   I was recently asked to do a bit of hierarchical analysis on Active Directory groups. No problem, I thought. I can extract group memberships into a table, then chuck the data through a CTE to give the hierarchy. In fact, it all went swimmingly until I hit a very basic problem; my CTE was failing when it hit the maxrecursion value. This indicated strongly that I had one or more circular references - groups that were members of themselves either directly or through inheritance.
Whilst defining the problem was easy, finding how to weed out the culprits was less so. Google seemed to give remarkably few results on how to actually find a circular reference. In fact, the solution is very straightforward. Hopefully my quick and dirty answer will be of some use.
I'll start off with a simple table containing two columns; child and parent.


CREATE TABLE ChildrenAndParents (Child [varchar](255) NOT NULL,
 Parent [varchar](255) NOT NULL)

INSERT INTO ChildrenAndParents (Child, Parent) values ('a', 'b')
INSERT INTO ChildrenAndParents (Child, Parent) values ('b', 'c')
INSERT INTO ChildrenAndParents (Child, Parent) values ('c', 'd')
INSERT INTO ChildrenAndParents (Child, Parent) values ('e', 'b')


The CTE I want to use to show the hierarchy of memberships in the table is one containing four columns; child, parent, level and hierarchypath. The definition is below:


WITH GroupMembers (Child, ParentGroup, Level, hierarchypath)
AS
(
-- Anchor member definition
SELECT g.child, g.parent, 0 AS Level, convert(varchar(max), g.child + '->' + g.parent) AS hierarchypath
FROM childrenandparents AS g
WHERE child not in (select parent from childrenandparents)
UNION ALL
-- Recursive member definition
SELECT g.child, g.parent, Level + 1, hierarchypath + '->' + g.parent
FROM childrenandparents as g
INNER JOIN GroupMembers AS gm
ON gm.parentgroup = g.child
)

select left(hierarchypath, charindex('->', hierarchypath) - 1) as child, parentgroup, level, hierarchypath
from groupmembers
option(maxrecursion 5);


This gives the result below. As you can see, it shows that, for instance, a is a member of d as a result of the path a->b->c->d.


child      parentgroup level       hierarchypath
---------- ----------- ----------- --------------------
a          b           0           a->b
e          b           0           e->b
e          c           1           e->b->c
e          d           2           e->b->c->d
a          c           1           a->b->c
a          d           2           a->b->c->d


Now, I'll put in a circular reference.

INSERT INTO ChildrenAndParents (Child, Parent) values ('d', 'b')


Of course, rerunning the CTE in this form fails, complaining with the error message


Msg 530, Level 16, State 1, Line 1



The statement terminated. The maximum recursion 5 has been exhausted before statement completion.
This is because b is a child of c, c is a child of d and d is a child of b, hence creating a hierarchy that would go on infinitely. Of course, in reality this is the point at which I encountered the problem - having a circular reference within several thousand rows and needing to find it.
The first step I took was to find all the hierarchy unaffected by the circular references. In order to do this, all I need to do is add a "where" clause to the recursive bit of the CTE so that it excludes any records where the parent is already in the hierarchical path.


WITH GroupMembers (Child, ParentGroup, Level, hierarchypath)
AS
(
-- Anchor member definition
SELECT g.child, g.parent, 0 AS Level, convert(varchar(max), g.child + '->' + g.parent) AS hierarchypath
FROM childrenandparents AS g
WHERE child not in (select parent from childrenandparents)
UNION ALL
-- Recursive member definition
SELECT g.child, g.parent, Level + 1, hierarchypath + '->' + g.parent
FROM childrenandparents as g
INNER JOIN GroupMembers AS gm
ON gm.parentgroup = g.child


--Exclude if the hierarchypath text contains the 'parent' value with a '->' before and after it
where gm.hierarchypath not like '%->' + g.parent + '->%'
)

select left(hierarchypath, charindex('->', hierarchypath) - 1) as child, parentgroup, level, hierarchypath
from groupmembers
order by level desc
option(maxrecursion 5);


I can now rerun the CTE and order the results by Level descending. This gives the following results.

child      parentgroup level       hierarchypath
---------- ----------- ----------- --------------------
e          d           2           e->b->c->d
a          d           2           a->b->c->d
e          c           1           e->b->c
a          c           1           a->b->c
a          b           0           a->b
e          b           0           e->b

Now, the whole problem with circular references is that it doesn't matter how many levels you go down, they'll still keep adding to the hierarchy path. The results above tell us that the "legitimate" data only goes down to level value 2, so anything that uses up more levels than that is data containing a circular reference.
In the CTE definition earlier, you'll see I set a maxrecursion level of 5 (i.e. it'll query recursively five times before giving up and erroring). Therefore I can now change the CTE's definition a bit to make use of this. Instead of excluding records where the parent has already appeared in the hierarchy path, I now leave them in. However, instead, I use a different "where" clause to stop the recursion above a set value. The value I choose is greater than the maximum "legitimate" level I discovered above, but below the maxrecursion value.

WITH GroupMembers (Child, ParentGroup, Level, hierarchypath)
AS
(
-- Anchor member definition
SELECT g.child, g.parent, 0 AS Level, convert(varchar(max), g.child + '->' + g.parent) AS hierarchypath
FROM childrenandparents AS g
WHERE child not in (select parent from childrenandparents)
UNION ALL
-- Recursive member definition
SELECT g.child, g.parent, Level + 1, hierarchypath + '->' + g.parent
FROM childrenandparents as g
INNER JOIN GroupMembers AS gm
ON gm.parentgroup = g.child

where level < 5
)

select left(hierarchypath, charindex('->', hierarchypath) - 1) as child, parentgroup, level, hierarchypath
from groupmembers
order by level desc
option(maxrecursion 5);

This gives the following results.

child      parentgroup level       hierarchypath
---------- ----------- ----------- --------------------
e          d           5           e->b->c->d->b->c->d
a          d           5           a->b->c->d->b->c->d
a          c           4           a->b->c->d->b->c
e          c           4           e->b->c->d->b->c
e          b           3           e->b->c->d->b
a          b           3           a->b->c->d->b
a          d           2           a->b->c->d
e          d           2           e->b->c->d
e          c           1           e->b->c
a          c           1           a->b->c
a          b           0           a->b
e          b           0           e->b


Remember that the "legitimate" data only goes up to level 2, so the top rows in this result are the ones containing the circular references. I can simply examine the string in the hierarchypath column to see where the circular reference is occurring, enabling me to see which records in my source data need to be corrected. In this case, it shows clearly the pattern of b -> c -> d occurring twice, so it must be the d -> b child/parent relationship causing the problem. Looking at the contents of the ChildrenAndParents table confirms there is a record with d as the child and b as the parent, so I've now found my problem record.

No comments:

Post a Comment