Fonto deep dive: Optimizing an XPath

A poorly performing application is pretty much always something which upsets users. And they should be; why would they need to wait for an application to finish processing? Writing high performances software is hard: it takes time, skill, discipline, and careful craftsmanship. Whether you’re writing a database application or a microservice that implements some algorithm, making it fast takes effort.

Fonto is no exception to that rule. We spend a lot of time optimizing the internals of our platform but we also help our partners optimize their applications.

Case: Unintentional inefficient XPath

Disclaimer: Although this case is extracted with consent from a partner application, this post is by no means blaming that partner. We have made the same mistakes at Fonto. I choose to use this example because it is easy to relate to.

Today I want to share an simple example of a few lines of codes that seemingly make sense but actually contains some hidden performance pitfalls. Take a look at the following code:

function removeCorrespondingXref (stepData, blueprint, format) {
    var contextNode = blueprint.lookup(stepData.contextNodeId);
    var idAttribute = contextNode.getAttribute('id');
    var query = "//xref[@rid='" + idAttribute + "']";
    var nodes = evaluateXPathToNodes(query, contextNode, blueprint);
    removeNodes(nodes, blueprint, format, true);
    return CustomMutationResult.ok();
}

This code searches for xref elements having a rid attribute value whos value matches the id attribute value of the context node. The found nodes are in turn removed from the document. Why this code is used is not of importance now but what is important is that this code is in a hot code path.

Problem 1: evaluateXpathToNodes

The first problem in this code is that the query is constructed by concatenating the value of the id attribute into a XPath query string. The concatenation in itself is not the problem but rather what will happen in the evaluateXpathToNodes function.

The first step is to parse the query from a string into an Abstract Syntax Tree (AST). The AST is an representation of the query that can be executed by the engine. Explaining all the details is subject for another post but for the purpose of this post it is sufficient to say that the parse step is slow compared to executing the AST. So what Fonto internally does is caching the parsed AST for each query.

The problem is that the query varies by the value of the id attribute. This means that there is a one-to-one correspondence between the id attribute value and the number of parses we have to perform. This effectively renders the cache useless.

The fix is quite straight forward: make the query static and pass in the id value as a variable. As demonstrated in the following code:

var query = "//xref[@rid=$id]";
function removeCorrespondingXref (stepData, blueprint, format) {
    var contextNode = blueprint.lookup(stepData.contextNodeId);
    var idAttribute = contextNode.getAttribute('id');
    var nodes = evaluateXPathToNodes(query, contextNode, blueprint,
        {
            id: idAttribute
        });
    baseFlowPrimitives.removeNodes(nodes, blueprint, format, true);
    return CustomMutationResult.ok();
}

We changed the query to be constant meaning that Fonto’s cache can now do it’s job properly. The query is parsed only once which amortizes the costs across all executions of the function.

There is another problem with this approach as it creates a security vulnerability. Assuming the attacker can control the value of the id attribute, he/she can execute arbitrary XPath code on behave of the unsuspecting user. The impact is low but it is still an attack vector. This vulnerability is similar to an SQL injection. Fortunately our change fixed both problems at the same time.

Our recommendation: Avoid generating queries dynamically, use variables instead.

Problem 2: query performance

On to the second problem; Lets take a closer at the query itself: //xref[@rid=$id]. It matches any xref element which has an rid attribute with a value that is equal to the value of the $id variable we just inserted.

This query performs a full document scan, meaning it will visit each node in the entire document. Making this a O(n) operation where n is the number of nodes in the document. This also means that the bigger the document, the slower this query will perform.

Can we do better? Yes, how does O(1) sound? This would mean that the query would execute in the exactly same time regardless of the number of nodes in the document.

Wouldn’t it be good if Fonto somehow knew which nodes have that specific id value as its rid attribute value? This is where the attribute index facility comes in.

addAttributeIndex(
    null,
    'rid',
    'http://app',
    'rid-lookup'
);var query = "Q{http://app}:rid-lookup($id)[self::xref]";
function removeCorrespondingXref (stepData, blueprint, format) {
    var contextNode = blueprint.lookup(stepData.contextNodeId);
    var idAttribute = contextNode.getAttribute('id');
    var nodes = evaluateXPathToNodes(query, contextNode, blueprint,
        {
            id: idAttribute
        });
    baseFlowPrimitives.removeNodes(nodes, blueprint, format, true);
    return CustomMutationResult.ok();
}

The addAttributeIndex function registers the index with Fonto. It works by making a new XPath function available called rid-lookup in the http://app namespace. This function will, given an input value, return all element with an rid attribute matching the input value. Fonto automatically maintains that index once added.

The second change is in the query itself: Q{http://app}:rid-lookup($id)[self::xref]. Essentially we reversed the query: first find all elements matching the $id value given our newly created rid-lookup function and then filter by xref element name.

Granted, I was hand-waiving earlier and it is technically not O(1) but still O(n). The difference is in the n. In the original code the n was the total of number of nodes in the document while in the optimized implementation n is the number of nodes that have an rid attribute with the specified id value. Which is typically a very small number in this particular application.

Our recommendation: Avoid a full document scan if you’re looking for element(s) with an attribute with an particular value, use an attribute index instead. The (r)id lookup is a poster child for this optimization.

Conclusion

We looked at a small function that was executed on a hot code path which scaled badly. First, we amortized the parsing cost of an XPath query across all executions of the function. Secondly, we made the runtime of the function give or take o(1). I didn’t provide measurements in this post because it was quite big without going into that level of detail but the impact of these changes was significant.

I hope this post is insightful for you. Feel free to reach out to me or the Fonto team if you have any questions.

Bert Willems

Product Architect @ Fonto. Passionate about structured content, web technology, computer sciences and natural language processing