Prefix Trees and Parsers

Prefix Trees and Parsers

My first idea was to actually lex and parse routes just like we do with programming languages. I already had a lexer infrastructure built (Using the Trie based approach), so why not use it?

I wrote some glue code to parse a set of routes into components, then generate a set of lexer tokens and parser rules out of it. I started with a few routes (3) just to see how things went:

$routes = [
“/user/{name}” => “User::findByName”,
“/user/{name}/{id:[0-9]+}” => “User::findByNameAndId”,
“/post/{name}” => “Post::findByName”,
];

I parsed these routes, and produced a series of tokens:

$tokens = [
T_USER => “user”,
T_POST => “post”,
T_NUMBER => “[0-9]+”,
T_SLASH => “/”,
T_ANY =>”[^/]”,
];

I also then generated a series of parser rules:

$parserRules = [
[[T_SLASH, T_USER, T_SLASH, T_ANY], “User::findByName”],
[[T_SLASH, T_USER, T_SLASH, T_ANY, T_SLASH, T_NUMBER], “User::findByNameAndId”],
[[T_SLASH, T_POST, T_SLASH, T_ANY], “Post::findByName”],
];

It should be easy to see how that works. It makes our parser behave exactly like our lexer. Which means we can re-use code! Yay!

$parser = new Trie;
foreach ($parserRules as $rule) {
$node = $parser;
foreach ($rule[0] as $token) {
if (!isset($node->data[$token])) {
$node->data[$token] = new Trie;
}
// Reset the node for the next character
$node = $node->data[$token];
}
$node->value = $rule[1];
}

Now, we can parse easily:

function parse(array $tokens, Trie $root) {
$length = count($tokens);
$i = 0;
$node = $root;
// We want to iterate over the entire string.
while ($i < $length) { // Get the current character $token = $tokens[$i]; if (isset($node->data[$token])) {
// We have a valid next token
$i++;
// Move to the next state
$node = $node->data[$token];
}else {

// We can’t continue parsing this node
// Since the URL must terminate with a single
// parsed result, we return false;
return false;
}
}
return $node->value;
}

So, we generate a lexer as before, and now with our generated parser, our router becomes:

public function route($url) {
$tokens = lex($url, $this->lexerRoot);
return parse($tokens, $this->parserRoot);
}

How do we solve this problem? Well, I decided to optimize the “default” case of [^/]. So I did two things. First, I added a member to the Trie implementation called “default”:

class Trie {
public $data = [];
public $value = false;
public $default = false;
}

If the current character isn’t in the $data array, then check to see if there’s a default value. If there is, treat it like a match.

The class looks very similar:

class Radix {
public $length = 0;
public $data = [];
public $value = false;
public $default = false;
}

But now we have this new “length” property. It stores the length of the longest subkey. Let’s look at a simple example, “apple” and “ape”.

$root = new Radix;
$root->data[“ap”] = new Radix;
$root->data[“ap”]->data[“e”] = new Radix;
$root->data[“ap”]->data[“e”]->value = “T_APE”;
$root->data[“ap”]->data[“ple”] = new Radix;
$root->data[“ap”]->data[“ple”]->value = “T_APPPLE”;
$root->data[“ap”]->length = 3; // strlen(“ple”)
$root->length = 2;

In other words, we now have a tree using the longest common prefix for the children.

About the author

admin administrator

Leave a Reply