Monday, February 12, 2007
Don't Forget The WHERE Clause
We've all done it but possibly not on this scale. Someone may have forgotten a WHERE clause when a bank customer received 75000 bank statements. At 27 pence per letter, that cost about £20000 (US$40000) just for postage. Unfortunately, it wasn't an isolated incident. In the same fortnight, the UK Government managed to fubar a mail merge to 26000 pensioners. In both cases, quality control seems to be evaded or absent.
Friday, December 15, 2006
Here's a mystery festive stored procedure. Don't run it on any critical servers.
Monday, November 13, 2006
Moderate A Yahoo Group (Part 2)
As mentioned in a previous post, the poll for moderating http://groups.yahoo.com/group/mysql/ will be closed soon. At present, the role will be awarded to one volunteer who would like to share the role.
Wednesday, November 01, 2006
Moderate A Yahoo Group
There is a huge number of support forums for MySQL, such as MySQL's Forums, and hundreds more for regional, language and application specific discussion.
However, the MySQL group on Yahoo seems to be abandoned. This is quite a shame, because the group often peaked more than 3000 messages per month and the archive has thousands of messages. A search for PHP returns more than 5000 messages. Anyhow, I asked Yahoo's support what could be done to re-activate this forum. Actually, its quite simple: its done democratically:
Thank you for writing to Yahoo! Groups.
Since the group currently has no active moderator, it may be possible for us to appoint a moderator to manage the group.
If you would like a new moderator for the group, please join the group and start a poll listing a few members, who would like to be a moderator, as choices for the poll. Please announce the poll to your group and ask the members to vote on who they would like to be moderator. Once the poll has closed, please email me back and I will appoint the "elected" moderator.
Thank you again for contacting Yahoo! Customer Care.
We can do this in steps:
- Step 1: Create a poll to find people interested in moderating.
- Step 2: Create a poll for group members to choose their favorite moderator.
- Step 3: Get Yahoo to appoint the moderator.
- Step 4: Due to the previous high volume of the group and the increase in spam, it would be prudent if the appointed moderator assigned others to help.
Step 1 is already in progress. This poll will be closed in a week or so, after I return from a consulting engagement. I'll step forward only if no-one else wants role.
So, how would you like to moderate or help moderate a forum that could exceed 100 messages per day?
Monday, October 23, 2006
Compiling The LAMP Stack From Source
If you like compiling from source then a collection of scripts may save you a considerable amount of effort, especially when it come to repeating the process when you upgrade.
Tuesday, September 12, 2006
A client informed me of TheDailyWTF and I've subsequently read it quite frequently. It often features bad SQL of all dialects. It also features stupid business rules, management blunders, bad authentication, bad validation, race conditions and dodgy error messages. You'll certainly remember what not to do and when the forum comments aren't sarcastic and purile then you'll also gain insight into good practice.
If you're completing the survey then please mention your MySQL experience :-)
Friday, June 30, 2006
Importing A Variable Number Of Fields
One of our clients is working on a cool astrophysics project and wants to import large quantities of observations. Fortunately, the data is in a simple percent delimited format:
SEE%5%8 XYZ%6%8 ZZZ%7
The fields represent the time, the instrument, the type of observation (solar flare, brightness measurement), the co-ordinates and the result. There are two types of problem with this format. Firstly, data is not normalized - its just a flatfile. Secondly, many of the fields have a context which depends on other fields. Formats with these properties are quite common in the wild but how to handle them?
The following proof of concept code allows percent delimited data to be read directly by MySQL Server and splits one type of data to a separate table. This is achieved with a case statement in a MySQL Server 5.0 stored procedure, called from a trigger:
DELIMITER // CREATE TABLE observation ( id INT PRIMARY KEY AUTO_INCREMENT, instrument CHAR(20) NOT NULL, x INT, y INT ) // CREATE TABLE observation_xyz ( id INT PRIMARY KEY AUTO_INCREMENT, x INT, y INT ) // CREATE PROCEDURE classify (IN id INT, IN instrument CHAR(20), IN x INT, IN y INT) BEGIN CASE instrument WHEN 'XYZ' THEN BEGIN INSERT INTO observation_xyz (id,x,y) VALUES (id,x,y); END; ELSE BEGIN END; END CASE; END // CREATE TRIGGER observation_classify_after AFTER INSERT ON observation FOR EACH ROW CALL classify (NEW.id, NEW.instrument, NEW.x, NEW.y) // LOAD DATA INFILE 'observations.txt' REPLACE INTO TABLE observation FIELDS TERMINATED BY '%' (@ins,@x,@y) SET instrument=@ins, x=@x, y=@y // SELECT * FROM observation // SELECT * FROM observation_xyz //
Import and export files are visible to the server, not necessarily the client. So, a small amount of PHP would allow form upload and generate a query which references the uploaded file on the server.
Finally, for deployment, the import table could use the Blackhole storage engine. This would eliminate tidying tasks.
Summarizing Numerous Rows
It is often the case that web applications display a paginated resultset. The most familiar example is search engine results. However, the usefulness of the results is in providing both detail and distinctness. SELECT DISTINCT isn't sufficient in this case because collapsed rows cannot be distinguished. Furthermore, it would be advantageous to have a summarizing threshold.
For this reason, I've created an example query which summarizes strictly more than 30 rows. This query uses the world database to display continents and countries. The query summarizes continents with more than 30 countries. A flag is provided so that an application can identify and display the summarized rows differently:
(SELECT continent, NULL AS country, 1 AS more FROM country GROUP BY continent HAVING COUNT(continent)>30) UNION (SELECT c0.continent, c0.name, 0 AS more FROM country AS c0, (SELECT continent FROM country GROUP BY continent HAVING COUNT(continent)<=30) AS c1 WHERE c0.continent=c1.continent) ORDER BY continent ASC, country ASC;
The query works by making a UNION of abbreviated and detailed results. Detailed results are obtained from the unabbreviated result set joined with the details.
For web applications, pagination can be performed appending LIMIT <offset>, <quantity> . However, it requires the entire result set to be obtained and sorted. ORDER BY for large result sets is relatively processor intensive. However, it ensures that output remains deterministic after additional INSERTs Therefore, it could be omitted.
If you're trying to generate output similar to search engines then be aware that search engines use a collection of tricks to significantly reduce sort and pagination overhead.
Monday, June 05, 2006
Get In The Middle Of A Chain Reaction! Part One
After reading about recursion and stack frames, I decided that I'd re-implement a computer strategy game. This fun game has various implementations and various names, including Chain Reaction and Atom Smasher. However, few, if any, have implemented a network version with the game logic written in MySQL stored procedures.
This strategy game is addictive, hard to describe and simple to play. The board consists of a number of territories. The objective is to win all of the territory. Players gain territory by placing counters in territories, one per turn. Once a territory is claimed by a player, a competing player cannot claim the territory directly. Instead, when a player's turn causes counters to exceed the number of adjacent territories then counters are redistributed among adjacent territories and the adjacent territory is gained by the player. (In some implementations, this threshold condition occurs when counters equal rather than exceed.) During a game, the overspill situation often occurs recursively. Indeed, this is the crux of the game and it causes the state of play to swing rapidly. In this regard, the game is comparable to Orthello, also known as Reversi. However, the complexity of Chain Reaction is such that a player can win with the most unlikely move.
Before you begin, you need to add the following to /etc/my.cnf and re-start MySQL Server 5.0 to enable recursive stored procedures:
Next we consider the board and the game engine. In trivial implementations, the board is a square grid. In more advanced versions, the grid may have holes. Furthermore, the board can be *huge*. I've played on boards which were 20×20 cells. It would be typical to implement these boards as a two dimensional array in a desktop application. The MySQL implementation will be a directed graph of arbitrary cells. That means we can form 2D grids and other more exotic arrangements by numbering the cells and describing their relationships. At present, the schema doesn't represent the cells graphically:
DELIMITER // CREATE SCHEMA chain // USE chain // CREATE TABLE cell ( id INT PRIMARY KEY, user INT NOT NULL DEFAULT 0, count INT NOT NULL DEFAULT 0, max INT NOT NULL DEFAULT 2 ) // CREATE TABLE adj ( `from` INT NOT NULL, `to` INT NOT NULL ) // CREATE PROCEDURE inc (IN user0 INT, IN node INT, IN depth INT, IN total INT) BEGIN DECLARE sum INT; DECLARE count0 INT; DECLARE max0 INT; DECLARE done INT DEFAULT 0; DECLARE buddy INT; DECLARE cur1 CURSOR FOR SELECT `to` FROM adj WHERE `from`=node; DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done=1; #SELECT "1",user,node,depth,total; IF depth<64 THEN #SELECT "2"; SELECT COUNT(*) INTO sum FROM cell; IF sum<total THEN SELECT count, max INTO count0, max0 FROM cell WHERE id=node LIMIT 1; #SELECT "3",count0,max0; IF count0<max0 THEN #SELECT "4"; UPDATE cell SET user=user0, count=count0+1 WHERE id=node; ELSE UPDATE cell SET user=user0, count=1 WHERE id=node; OPEN cur1; REPEAT FETCH cur1 INTO buddy; IF NOT done THEN CALL inc (user0,buddy,depth+1,total); END IF; UNTIL done END REPEAT; CLOSE cur1; END IF; END IF; END IF; END //
Writing the stored procedure wasn't easy but you can see my debug technique from the commented statements. Anyhow, to test this recursion, we'll create a trivial board with two cells:
INSERT INTO cell (id,user,count,max) VALUES (0,0,0,4) // INSERT INTO cell (id,user,count,max) VALUES (1,0,0,4) // INSERT INTO adj (`from`,`to`) VALUES (0,1) //
If you make repeated calls to the stored procedure and select results then you'll confirm that counters spill from cell 0 to cell 1:
CALL inc (1,0,0,256) // SELECT * FROM cell //
However, the total number of counters in the system doesn't always increase. We'll fix that in a future article.
Saturday, June 03, 2006
Mandelbrot Fractals Using Stored Procedures
DELIMITER // CREATE SCHEMA fractal // USE fractal // CREATE TABLE disp ( id INT PRIMARY KEY, content VARCHAR(1000) ) // CREATE TABLE shade ( id INT PRIMARY KEY, value INT ) // INSERT INTO shade (id,value) VALUES (0,ORD(' ')),(1,ORD('.')),(2,ORD(',')), (3,ORD('-')),(4,ORD('~')),(5,ORD('*')),(6,ORD(':')),(7,ORD(';')),(8,ORD('+')), (9,ORD('o')),(10,ORD('O')),(11,ORD('&')),(12,ORD('%')),(13,ORD('@')) // CREATE PROCEDURE mandelbrot (IN x_max INT, IN y_max INT) BEGIN DECLARE x INT DEFAULT 0; DECLARE y INT DEFAULT 0; DECLARE xf0 FLOAT; DECLARE yf0 FLOAT; DECLARE xf1 FLOAT; DECLARE yf1 FLOAT; DECLARE tf FLOAT; DECLARE l INT; DECLARE l_max INT; DECLARE buffer VARCHAR(1000); DECLARE tint INT; TRUNCATE disp; SELECT MAX(id) INTO l_max FROM shade LIMIT 1; WHILE y<y_max DO SET buffer=''; SET x=0; WHILE x<x_max DO SET xf0=x*4.0/x_max-2.0; SET yf0=y*4.0/y_max-2.0; SET xf1=xf0; SET yf1=yf0; SET l=0; WHILE l<l_max AND xf1*xf1+yf1*yf1<4.0 DO SET tf=xf1*xf1-yf1*yf1+xf0; SET yf1=2.0*xf1*yf1+yf0; SET xf1=tf; SET l=l+1; END WHILE; SELECT value INTO tint FROM shade WHERE id=l; SET buffer=CONCAT(CHAR(tint),buffer); SET x=x+1; END WHILE; INSERT INTO disp (id,content) VALUES (y,buffer); SET y=y+1; END WHILE; SELECT content FROM disp ORDER BY id ASC; END // CALL mandelbrot (50,20) //
Saturday, May 20, 2006
Normalization Is Not Enough!
Some people think that normalization is the ultimate goal of databases. Pure bunkum. Your data should be structured to make typical queries tractable. A typical example is search engines. Normalization isn't sufficient if access time sucks.
Admittedly, search engines are a relatively trivial case. Regardless, query time contributes to user satisfaction and insertion time contributes to search coverage. If your schema handles both with ease then you're done. To achieve this, your schema will be normalized, or at the very least, strategically de-normalized.
This is very good but sometimes the simplest and most reasonable requests leave me stumped:
I want to do a simple query that displays categories and all it's nested subcategories in order. ... Is it possible to do this with one query or must I make a loop to get them all?
A similar request is to count subcategories of one or more categories. This is so it can be displayed in web navigation in the form "Apparel (200 products in 14 subcategories), Music (600 products in 48 subcategories)". Auction websites do it in real-time. Therefore, it is possible but it has to be a speedy operation.
Well, I knew part of the answer but not all. There's a standard technique for storing trees in database tables. Nodes are stored in one table. Each node is stored as one row. Each row has a primary key and a reference to its parent. So, each node points to its parent. So, if you want a branch of the tree then select all rows which point to a specific parent:
SELECT id, name FROM tree WHERE parent=17
The remainder of the problem (or a novel solution) was unknown to me. I thought about it for a while. Indeed, I was puzzled for days. I had a solution that worked for a finite depth but it wasn't fast. Nor did it cover the general case. My conclusion was that there was a trick but I didn't know the trick. This is "trick" in a specific context. Many apps are based on a trick which makes them possible. Wordprocessors use doubly linked lists to hold content. Spreadsheets use recursion to calculate cells. Real-time games are turn-based games where the turns are incredibly fast.
Discover the trick then all else is algorithms and ergonomics. As expected, the trick that I sought was known. A search revealed that Managing Hierarchical Data In MySQL by my colleague Mike Hillyer was the solution to this and many related problems. It also references a standard text on the subject: Trees And Hierarchies In SQL by Joe Celko. It appears that I have much to learn from this book.
Anyhow, the solution is to traverse the tree and label each node with sequential numbers. If numbers are assigned when descending and ascending the tree then each subcategory obtains a nested pair of numbers. So, for a trivial tree, a lone subcategory is numbered 2 and 3 while its parent is numbered 1 and 4. This numbering scheme is in addition to a primary key and a parent ID. It is also possible to explicitly assign depth. Therefore, you could performing the following query:
SELECT b.`id`, b.`depth`, FLOOR((b.`end`-b.`begin`-1)/2) AS subcats, b.`title` FROM `tree` AS a, `tree` AS b WHERE a.`id`=17 AND a.`begin`<=b.`begin` AND a.`end`>=b.`end` ORDER BY b.`begin`;
This obtains an array of data which is ideal for rendering. However, inserting or moving branches is no longer trivial. For example, if multiple users can modify the tree structure then you'll require table locking. However, inserting categories is deemed to occur less frequently than items within categories. So, it is worthwhile.
That was just the trivial case. Now try this case:
It would be much appreciated if any one could help me solve this problem that I've got. ... First of all my problem is that I need to create an online journey planner for able and disabled persons to tourist attractions in London. The user must specify what attraction they wish to visit, their starting station and if they are disabled on not.
We'll pick off a couple of simple tricks. Then we'll form a heinously complicated graph. Then we'll reduce the graph to the previously solved tree.
Alright, the tourist attraction can have a co-ord. The stations can also have co-ords. We can use Pythagoras to select nearest destination station from either a full set or a resticted set. This finds the 10 nearest attractions, sorted by distance squared, from the arbitrary co-ord (500,300):
SELECT (x-500)^2+(y-300)^2 AS `dist^2`, feature, url FROM attraction ORDER BY `dist^2` ASC LIMIT 10;
That alone is quite natty and can make a positive impact on people's lives. However, the next part - finding the route from the initial station to the destination station - is a very hairy problem. There's an implicit rule that we should find the quickest route. However, the definition of fastest is contentious. We can use a weighted graph to find the quickest route deterministically and without hard-coding rules to minimize changes. The time between each station is different. However, for version one, we assume that they are equal at three minutes. There's also delay each time you change train. If you want to get really pedantic this period varies inversely with train frequency. The penalty should also be different for each change at each station. However, for version one, we'll assign a fixed weighting of 10 minutes. Stations also have temporal properties, which we'll defer to version two. Some stations are closed on certain days or late in evenings. For safety, others are exit only a peak periods. This implies that return journeys may be vastly different to outbound journeys.
The graph of stations is a nightmare but it is surmountable. Some routes are circular. Some have unidirectional loops at one end. One route branches, joins itself and then fans-out! This case means that you can go directly from A to B and from B to C but you cannot go from A to C without changing. We solve this case by defining a route as one or more segments. So, it is possible to change from one segment of a route to a different segment of the same route. Having segments also solves problems with the circular route. No-one travels a complete loop of a circle, so this can be broken into two or more overlapping horseshoe segments. Alternatively, you can define multiple circuits. Both options can give duplicate results.
If we define segments as unidirectional then we can solve more cases. In the trivial case, a route has two segments - one for each direction. The unidirectional property is only useful for exception cases. So, the route with one looped end can be defined in one segment rather than two. To summarize, we decompose routes into segments and find the optimal traversal of segments. But how do we traverse the segments?
We've got a weighted, directional, fully connected graph of possible journeys. Do we query it directly? No, we map it into a tree - a decision tree. Effectively, we reduce the problem to a previously solved case. So, the top branches of the tree are the initial stations and the destinations are the leaves of the tree. From our previous trick, we can locate any node within a specified bound - or locate the fastest journey directly. However, creating this tree requires one step of pre-processing. Whenever we update the graph, we have to re-build the tree. This occurs very infrequently during normal operation. Regardless, you need a pre-processing utility which finds the quickest direct and indirect journeys from all stations to all stations.
For end users, a typical query involves finding one branch of the tree. Once you know the ID of the destination, you recurse backwards through the route to find parent IDs and descriptions. Where the maximum number of changes is known (such as this case), you can de-normalize and obtain all details from one row. Furthermore, this row can be indexed with a compound key of initial station and destination station. Furthermore, this can be combined in a single select with pre-processed co-ord matching and can even give details to and from alternative stations.
Does this trick work in the general case? No. For n nodes, it requires n^2 rows in one table. Furthermore, if you choose to de-normalize and the maximum number of interchanges is high then rows can be quite verbose. However, where complexity is bounded and updates are infrequent, mapping a graph to a special tree is an ideal solution.
In all cases (search engine, subcategories, journey planner) the database may be normalized. However, queries are only rapid if the data is suitably structured. Such structuring may not be obvious.
- Mornington Crescent - A work of collective fiction that alludes to make moves on the graph of the London Underground more challenging and studied than chess. The name is derived from the objective of finishing on the most tortuously connected node of the graph.
Thursday, May 11, 2006
To BLOB Or Not To BLOB
A topical question is whether BLOBs should be served from a database. My colleague, Jim Starkey, invented the term BLOB and, understandably, is enthusiastic about the concept. However, others are concerned about the practical overhead. So, after some discussion, the answer is maybe, depending on your application. Here's a guide for whether hold your BLOBs in or out of a database:
- If your BLOBs are *huge* (hundreds of megabytes) then use a filing system.
- If the size of your BLOBs are tiny (sector size or smaller) then use a database.
- If you need scale-out then you've got a borderline case. Keep reading.
- If you've got a huge number of BLOBs then you've got another borderline case.
- If you want concurrency then you've got another borderline case.
- If you want relational structure (categories, tags, RSS) then use a database.
- If you want historical copies then use a database.
- If you'll only access by a hierarchical namespace or a trivial flat namespace then use a filing system.
Rule 1: Huge BLOBs
This is a matter of practicality. If you're serving disk images from a database then lumping multiple disk images into one database creates problems for backup which may impinge on unrelated applications. Obviously, the absolute BLOB size threshold will rise with increasing media capacity. However, the practical aspect remains. Don't put multiple disk images in a database table.
Rule 2: Tiny BLOBs
This a forte of databases. Filing systems are generally inefficient at storing small BLOBs whereas databases are very efficient at packing rows. Ergo, pack your BLOB in a row and use your database as an smarter, improved media manager. Plus, you get nifty features for free, such as indexing and replication. Even if the *average* size is smaller than one or two sectors then this makes perfect sense.
Admittedly, some filing systems, such as ReiserFS are very good at handling small BLOBs by creating inodes dynamically and allowing sectors to be shared. Unfortunately, this type of filing system cannot be guaranteed. However, you can insist that your application interfaces to specific implementations of SQL database. So, that's the current practice.
Rule 3: Scale-Out
This is more contentious. Some insist that BLOBs be served from a database for maximum scale-out. Others say that the increased database load creates unecessary pressure to scale-out. Some say that it is best to handle this data as static content and serve it from the presentation layer. Then others question the static nature of those BLOBs. It also precludes access rights.
I've seen the details of this argument while involved in yet another J2EE PetStore development. For those blissfully ignorant of PetStore, it is a reference 100% Pure Java, J2EE, ECommerce application which has been used as the basis of at least 200 ECommerce applications. Most are glorified shopping baskets, like the original app. All are incompatible. However, some have innovative features, such as very tight integration with couriers.
Anyhow, the management interface of the reference implementation allows product images to be uploaded. It places them in the uploaded images in the filing system. Do you see the problem yet? This is the filing system of the application layer and not the presentation layer. Therefore, it works on one server but it will fail in a scale-out environment, such as a WebLogic cluster. My suggested solution to have a static content server mounted by all application servers was rejected on many counts. Firstly, it was deemed unduly onerous. Secondly, it was deemed to be a single point of failure. Thirdly, it was deemed to be a security problem. Fourthly, it was deemed not to be in the spirit of 100% Pure Java. Fifthly, there were eroneous objections to mounts and paths being platform specific, despite the application already specifying such paths.
The scale-out solution is to write an EJB [Enterprise Java Bean] which serves BLOBs. This cunningly sidesteps special configuration and associated security problems. Unfortunately, this ignores code bloat and associated problems with code quality. If you're lucky, you'll get an EJB to serve BLOBs and an EJB to manage BLOBs. These will use their descriptors to locate a database facade via JNDI. If you're unlucky, you'll get a EJB to serve each type of BLOB and management EJBs with messaging and an XML facades using their own DTDs. Furthermore, each EJB creates its own database connection and the descriptors only work in one enterprise environment. I saw the latter.
The underlaying code wasn't brilliant either. This was to use the first version of Oracle's 100% Pure Java JDBC driver. Unfortunately, BLOBs were unimplemented (doh!) and CLOBs were capped at 4KB. So, it was a case of choose any two: Oracle9i, BLOBs in the database for scale-out and/or 100% Pure Java. Well, the latter was dropped on the basis that an improved connector should become available.
Matters are improved nowadays. However, if you want BLOBs in a database for scale-out then it won't solve your problems. It will only give you a different set of problems.
Rule 4: Huge Number Of BLOBs
If you've got a huge number of BLOBs then you'll either require too many inodes from your filing system or you'll have huge database tables. Both are problematic. Trivial cases are obvious candidates for using a filing system. If you're serving multimedia on the web then handle as static content and use DNS hacks to scale-out. Rich cases should consider properties in more detail.
Rule 5: Concurrency
If multiple users will be manipulating BLOBs then database involvement is obvious. However, this doesn't preclude references to stuff in a filing system. If updates come from multiple sources then the most suitable handling of BLOBs remains undecided.
Rule 6: Relational Structure
If you need to do relational stuff then use a database. If you want to tag items in multiple categories or access by criteria, such as dates, then use a database. A typical example is blogging. If you want to implement an RSS feed which sends the titles of the most recent articles then index your datestamps. This makes selecting the most articles into a trivial database operation.
If you want access rights and logging then this could be provided by either a database or specific filing systems. However, using a database requires access via an application. This allows you to implement your own rights and logging with the database implementation as a fallback. A filing system may or may not provide these facilities.
All of this assumes that your BLOBs aren't huge (Rule 1). In which case, use references. A 64KB limit isn't huge.
Rule 7: Historical Copies
Hold historical data inside a database. This eliminates filing system clutter. Again, this assumes that your BLOBs aren't huge (Rule 1).
However, don't index a status column if the majority of fields will eventually have the same value. The cardinality will initially be varied and therefore makes an ideal index for retrieval. However, as time progresses, the cardinality will fall and the index will eventually be ignored.
Rule 8: Hierarchical Access
If you've got this far then you don't have awkward BLOB sizes or quantities, or scale-out concerns. You don't require access rights, relational structure or historical copies. So use your filing system!