Jolt.NET Development Blog: 2009/08

In my last post, I sketched out the algorithms dealing with XML equality and equivalency assertions. During the testing of my implementations, I discovered a fundamental problem with the XML equivalency algorithm that made it a nuisance to use: the reporting of the inequivalent element while ignoring element sequence can potentially be too generic and not very usable. In this post, I will describe some of the problems I encountered and the currently-implemented remediation. For conciseness, use of the the term inequivalent will imply that only element sequence ordering is ignored.

XML Equivalency Assertion Algorithm

When we attempt to determine if a list of nodes is equivalent to another list of nodes, we generally perform a set-equality operation on two list-like data structures, comparing each element in one list is to every element in the other list. Furthermore, given that the elements are not ordered, we can’t stop searching for an element once we detect the first mismatch since the element may exist further ahead in the list of elements to search. Reporting the precise element causing the inequivalency thus becomes difficult since we need to store extra information: should we stop the search because the mismatch is indeed the mismatch we are looking for, or should we ignore the mismatch so we can continue searching for the inequivalency in the rest of the tree?

Here is an algorithm that attempts to address the requirements of the XML equivalency assertion, as noted above.


void AreEquivalent(XElement expected, XElement actual)
{
  HashSet<XElement> expectedElements = new HashSet<XElement>(expected.Elements());
  HashSet<XElement> actualElements = new HashSet<XElement>(actual.Elements());

  if (expectedElements.Count != actualElements.Count)
  {
    // Report 'actual' as offending element.
    throw new Exception();
  }

  foreach (XElement expectedChild in expectedElements)
  {
    XElement equivalentChild = actualElements.FindFirstOrDefault(e =>
    {
      try { AreEquivalent(expectedChild, e); }
      catch(Exception) { return false; }
      return true;
    });

    if (equivalentChild == null)
    {
      // Report 'expectedChild' as offending element.
      throw new Exception();
    }

    actualElements.Remove(equivalentChild);
  }
}

This algorithm, hereon mentioned as AreEquvialent, runs in quadratic time (expected for set comparison of two lists), and leverages the optimization that only the first inequivalent element need to be reported. If all inequivalent elements were to be reported, we would need to implement one of the many tree-diff algorithms, all of which have greater runtime complexity.

Reporting Offending Elements

Consider the following XML snippets which are clearly inequivalent due to the presence of an unexpected child element from the root node.


<!-- Expected XML -->
<parent>
  <descendant xmlns="ns"/>
  <descendant xmlns="ns"/>
  <childNode/>
</parent>


<!-- Actual XML -->
<parent>
  <bogusNode/>
  <descendant xmlns="ns"/>
  <descendant xmlns="ns"/>
</parent>

We want to report that bogusNode is the offending element, but at the same time we want to avoid categorizing the following related XML snippets as invalid.


<!-- Expected XML -->
<root>
  <parent>
    <descendant xmlns="ns"/>
    <descendant xmlns="ns"/>
    <childNode/>
  </parent>
  <parent>
    <descendant xmlns="ns"/>
    <descendant xmlns="ns"/>
    <bogusNode/>
  </parent>
</root>


<!-- Actual XML -->
<root>
  <parent>
    <bogusNode/>
    <descendant xmlns="ns"/>
    <descendant xmlns="ns"/>
  </parent>
  <parent>
    <descendant xmlns="ns"/>
    <childNode/>
    <descendant xmlns="ns"/>
  </parent>
</root>

At a first glance, AreEquvialent solves the problem of reporting bogusNode from the first pair of XML documents. However, although AreEquvialent detects the offending element, it fails to report this element when the recursive calls unwind back the the original caller. If you trace AreEquvialent for the examples above, you will notice that it always reports the root element of the XML as the offending element, for any inequivalent pair of XML documents.

Algorithm Improvement

The problem with AreEquvialent is if the condition that detects the offending element evaluates to true in one stack frame, it will also evaluate to true again in the preceding stack frame, until the recursive call completely unwinds. In other words, reporting that bogusNode is inequivalent also causes its parent to be reported as inequivalent through the same condition – see lines 9, 17, and 24..

I spent a considerable amount of time modifying AreEquvialent in an attempt to address this problem. Unfortunately, many of my attempts failed as they either reintroduced the condition issue in another manner, or failed to address the issue of premature search termination. I ultimately realized that while a single-pass through the XML is desirable, it was simply insufficient for what the problem demanded.

As you may recall, the algorithm for strict equality is a linear-time algorithm that positionally compares elements from two documents. If the elements at a given position are not equal, the algorithm terminates immediately and classifies the documents as not equal. Ideally, AreEquvialent could behave this way if the element ordering of the two documents were similar. So, a new solution to the equivalency problem is as follows.

Order the elements of a document such that they match the order of elements in the reference document as best as possible
Run a the equality algorithm with the reference and transformed documents
- The algorithm terminates upon detection of the first inequivalent node, or if the two documents are equal

What does “as best as possible” mean? It means that we need AreEquvialent to do some additional work in attempting to detect candidate inequivalent elements during the canonicalization step. For instance, if the best match for a given element is an inequivalent one that differs only in number of child elements, we would want to reorder this element so that the equality algorithm reports a specific and relevant error: inequivalency due to mismatch of number of child elements, for the specific elements involved.

Here is the algorithm that performs the canonicalization of element order. It runs in quadratic-time and is invoked recursively prior to calling AreEquvialent.


void NormalizeElementOrder(IEnumerable<XElement> expectedChildren, List<XElement> actualChildren)
{
  int nextUnorderedElementIndex = 0;  
  foreach (XElement expectedChild in expectedChildren)
  {
    int numChildElements = expectedChild.Elements().Count();

    Predicate<XElement> isEquivalentToExpectedChild = new Predicate<XElement>(e =>
 {
   // Similar to AreEquivalent, but without the recursive call to determine child element equivalency.
   try { AreElementsEquivalent(expectedChild, e); }
   catch (Exception) { return false; }
   return true;
 });
    Predicate<XElement> isEquivalentToExpectedChild_Strict =
   e => isEquivalentToExpectedChild(e) &&
           numChildElements == candidateElement.Elements().Count();

    // Look for the element that best resembles expectedChild considering the configuration
    // the assertion, along with the number of child elements for expectedChild.
    int equivalentChildIndex = actualChildren.FindIndex(nextUnorderedElementIndex, isEquivalentToExpectedChild_Strict);

    if (equivalentChildIndex < 0)
    {
      // We couldn't find a matching element.  Relax the search criteria
      // by removing the matching child element count constraint and try again.
      equivalentChildIndex = actualChildren.FindIndex(nextUnorderedElementIndex, isEquivalentToExpectedChild);
      if (equivalentChildIndex < 0) { return; }   // No match found, stop normalizing element order.
    }

    SwapElement(actualChildren, equivalentChildIndex, nextUnorderedElementIndex);
    ++nextUnorderedElementIndex;
  }
}

Is this the best solution? I hope so! Readers, please share your thoughts on this algorithm along with ideas for improving it.

Now that the equivalency algorithm is implemented and tested, I can resume working on the NUnit and VSTS adapters to the core assertion library. I am hoping to have the relevant sources sources committed within the next week or so.

Jolt.NET Development Blog

About Jolt.NET Libraries

Jolt.NET 0.3 and Future Features

Error Reporting for Xml Equivalency Assertion

Search

About Me

Resources

Archive