Saturday, May 20, 2006
Normalization Is Not Enough!
Some people think that normalization is the ultimate goal of databases. Pure bunkum. Your data should be structured to make typical queries tractable. A typical example is search engines. Normalization isn't sufficient if access time sucks.
Admittedly, search engines are a relatively trivial case. Regardless, query time contributes to user satisfaction and insertion time contributes to search coverage. If your schema handles both with ease then you're done. To achieve this, your schema will be normalized, or at the very least, strategically de-normalized.
This is very good but sometimes the simplest and most reasonable requests leave me stumped:
I want to do a simple query that displays categories and all it's nested subcategories in order. ... Is it possible to do this with one query or must I make a loop to get them all?
A similar request is to count subcategories of one or more categories. This is so it can be displayed in web navigation in the form "Apparel (200 products in 14 subcategories), Music (600 products in 48 subcategories)". Auction websites do it in real-time. Therefore, it is possible but it has to be a speedy operation.
Well, I knew part of the answer but not all. There's a standard technique for storing trees in database tables. Nodes are stored in one table. Each node is stored as one row. Each row has a primary key and a reference to its parent. So, each node points to its parent. So, if you want a branch of the tree then select all rows which point to a specific parent:
SELECT id, name FROM tree WHERE parent=17
The remainder of the problem (or a novel solution) was unknown to me. I thought about it for a while. Indeed, I was puzzled for days. I had a solution that worked for a finite depth but it wasn't fast. Nor did it cover the general case. My conclusion was that there was a trick but I didn't know the trick. This is "trick" in a specific context. Many apps are based on a trick which makes them possible. Wordprocessors use doubly linked lists to hold content. Spreadsheets use recursion to calculate cells. Real-time games are turn-based games where the turns are incredibly fast.
Discover the trick then all else is algorithms and ergonomics. As expected, the trick that I sought was known. A search revealed that Managing Hierarchical Data In MySQL by my colleague Mike Hillyer was the solution to this and many related problems. It also references a standard text on the subject: Trees And Hierarchies In SQL by Joe Celko. It appears that I have much to learn from this book.
Anyhow, the solution is to traverse the tree and label each node with sequential numbers. If numbers are assigned when descending and ascending the tree then each subcategory obtains a nested pair of numbers. So, for a trivial tree, a lone subcategory is numbered 2 and 3 while its parent is numbered 1 and 4. This numbering scheme is in addition to a primary key and a parent ID. It is also possible to explicitly assign depth. Therefore, you could performing the following query:
SELECT b.`id`, b.`depth`, FLOOR((b.`end`-b.`begin`-1)/2) AS subcats, b.`title` FROM `tree` AS a, `tree` AS b WHERE a.`id`=17 AND a.`begin`<=b.`begin` AND a.`end`>=b.`end` ORDER BY b.`begin`;
This obtains an array of data which is ideal for rendering. However, inserting or moving branches is no longer trivial. For example, if multiple users can modify the tree structure then you'll require table locking. However, inserting categories is deemed to occur less frequently than items within categories. So, it is worthwhile.
That was just the trivial case. Now try this case:
It would be much appreciated if any one could help me solve this problem that I've got. ... First of all my problem is that I need to create an online journey planner for able and disabled persons to tourist attractions in London. The user must specify what attraction they wish to visit, their starting station and if they are disabled on not.
We'll pick off a couple of simple tricks. Then we'll form a heinously complicated graph. Then we'll reduce the graph to the previously solved tree.
Alright, the tourist attraction can have a co-ord. The stations can also have co-ords. We can use Pythagoras to select nearest destination station from either a full set or a resticted set. This finds the 10 nearest attractions, sorted by distance squared, from the arbitrary co-ord (500,300):
SELECT (x-500)^2+(y-300)^2 AS `dist^2`, feature, url FROM attraction ORDER BY `dist^2` ASC LIMIT 10;
That alone is quite natty and can make a positive impact on people's lives. However, the next part - finding the route from the initial station to the destination station - is a very hairy problem. There's an implicit rule that we should find the quickest route. However, the definition of fastest is contentious. We can use a weighted graph to find the quickest route deterministically and without hard-coding rules to minimize changes. The time between each station is different. However, for version one, we assume that they are equal at three minutes. There's also delay each time you change train. If you want to get really pedantic this period varies inversely with train frequency. The penalty should also be different for each change at each station. However, for version one, we'll assign a fixed weighting of 10 minutes. Stations also have temporal properties, which we'll defer to version two. Some stations are closed on certain days or late in evenings. For safety, others are exit only a peak periods. This implies that return journeys may be vastly different to outbound journeys.
The graph of stations is a nightmare but it is surmountable. Some routes are circular. Some have unidirectional loops at one end. One route branches, joins itself and then fans-out! This case means that you can go directly from A to B and from B to C but you cannot go from A to C without changing. We solve this case by defining a route as one or more segments. So, it is possible to change from one segment of a route to a different segment of the same route. Having segments also solves problems with the circular route. No-one travels a complete loop of a circle, so this can be broken into two or more overlapping horseshoe segments. Alternatively, you can define multiple circuits. Both options can give duplicate results.
If we define segments as unidirectional then we can solve more cases. In the trivial case, a route has two segments - one for each direction. The unidirectional property is only useful for exception cases. So, the route with one looped end can be defined in one segment rather than two. To summarize, we decompose routes into segments and find the optimal traversal of segments. But how do we traverse the segments?
We've got a weighted, directional, fully connected graph of possible journeys. Do we query it directly? No, we map it into a tree - a decision tree. Effectively, we reduce the problem to a previously solved case. So, the top branches of the tree are the initial stations and the destinations are the leaves of the tree. From our previous trick, we can locate any node within a specified bound - or locate the fastest journey directly. However, creating this tree requires one step of pre-processing. Whenever we update the graph, we have to re-build the tree. This occurs very infrequently during normal operation. Regardless, you need a pre-processing utility which finds the quickest direct and indirect journeys from all stations to all stations.
For end users, a typical query involves finding one branch of the tree. Once you know the ID of the destination, you recurse backwards through the route to find parent IDs and descriptions. Where the maximum number of changes is known (such as this case), you can de-normalize and obtain all details from one row. Furthermore, this row can be indexed with a compound key of initial station and destination station. Furthermore, this can be combined in a single select with pre-processed co-ord matching and can even give details to and from alternative stations.
Does this trick work in the general case? No. For n nodes, it requires n^2 rows in one table. Furthermore, if you choose to de-normalize and the maximum number of interchanges is high then rows can be quite verbose. However, where complexity is bounded and updates are infrequent, mapping a graph to a special tree is an ideal solution.
In all cases (search engine, subcategories, journey planner) the database may be normalized. However, queries are only rapid if the data is suitably structured. Such structuring may not be obvious.
- Mornington Crescent - A work of collective fiction that alludes to make moves on the graph of the London Underground more challenging and studied than chess. The name is derived from the objective of finishing on the most tortuously connected node of the graph.