Wednesday, August 29, 2007

Follow up on LOAD XML

A few weeks ago (time flies..) i posted a short note about my LOAD XML contribution. It has actually been available for some time, veven if it is not so easy to find. It is included in MySql 5.2.5:
download
documentation - quite good I think!

It is a bit sad that it is included in 5.2 (still in alfa) and not in 5.1. Since it is a separate command it will not affect existing installations and would not be risky. That faster it gets included the faster we will get feedback on it, which is what I believe is needed right now. Is this something that MySql users need and will use? Or will they use other tools to read XML data into their MySql databases?

LOAD XML reads data into a MySql table in much the same way as LOAD DATA does and in fact the implementation is based on LOAD DATA. This means that it does not parse the entire XML document ( I belive MySql xml function ExtractValue() does that) but instead reads one row at a time from the xml file and inserts it into the database. This makes reading of large xml files possible at quite a decent speed, althoug not as fast as LOAD DATA.

The mapping between XML and the table is based on two assumptions:
- there should be an XML tag that corresponds to a record in the table. You name this tag in the 'ROWS IDENTIFIED BY' clause
- both tags and attributes are matched automatically with column names and values
inserted into the tables. Tags and attributes that do not match columns are simply ignored

When I implemented this the scenario it was designed for is an integration scenario, where you receive data in an XML file periodically and dump it into your database.

There are of course limitations you should be aware of:
- CDATA is not supported and binary fields can not be handled
- it can only load data from a file
- the XML is not verfied, an illegal XML could very well be processed, no validation with xml schema or DTD is made
- there is no namespace support

All in all I think this could be quite a useful function. Arnold Daniels, who worked with me on this contribution initially, is planning some improvements. Among other things to make this work with XML data that is not read from a file.

Personally I will perhaps make some minor fixes, but probably start on something entierly different. Don't know what yet, your suggestions are welcome....

Monday, August 13, 2007

LOAD XML contribution added to MySQL

I got news from the MySQL people that my LOAD XML contribution is added to release 5.2. Check the manual page.

The aim of the contribution is to simplify reading of XML data into a MySQL database. The LOAD XML command supports three different xml formats:
- field values as attributes
- field values as tags
- the format produced by the mysql -x command, with the field name as an attribute and field value as a tag

If the LOAD XML command finds a matching field in the target table, the value is inserted, otherwise it is ignored. If you have used the LOAD DATA command (most of us have) you should recognize much of the functionality, LOAD XML works much the same way.

Monday, August 06, 2007

More on XML output from MySql

In a previous post I described how you could create XML output from SQL using stored functions. But that was a flat structure, with only one table and simple XML structure with a tag for every row in the result set, and column values as attributes or sub tags. Part of the real power of the XML format is that it supports more complicated structures, with hierarchical data. If you need to produce XML for integration with another application you will probably have to produce a XML with data from several tables, and a hierarchical structure. To accomplish this in MySQL we can use the GROUP_CONCAT function, combined with our XML stored functions.

An example:
We take our select statement from the previous post, that gives us a list of tables in xml format:

select xml_tag('table',null, concat(xml_attr('name',t.table_name),
xml_attr('engine',t.engine)),null)
from information_schema.tables t
where t.table_schema = 'INFORMATION_SCHEMA'


We then add a join with the columns table to get column data:
inner join information_schema.columns c on c.table_name = t.table_name and c.table_schema = t.table_schema

and replace the last parameter of the xml_tag function call, the one for the subtags with a new xml_tag function call to produce a column tag:
xml_tag('column',c.column_name,null,null)


This gives us:
select xml_tag('table',null, concat(xml_attr('name',t.table_name),
xml_attr('engine',t.engine)),xml_tag('column',c.column_name,null,null))
from information_schema.tables t
inner join information_schema.columns c on c.table_name = t.table_name and c.table_schema = t.table_schema
where t.table_schema = 'INFORMATION_SCHEMA'


Running this gives us:

<table name="CHARACTER_SETS" engine="MEMORY" >
<column>CHARACTER_SET_NAME</column></table>
<table name="CHARACTER_SETS" engine="MEMORY" >
<column>DEFAULT_COLLATE_NAME</column></table>
<table name="CHARACTER_SETS" engine="MEMORY" >
<column>DESCRIPTION</column></table>
<table name="CHARACTER_SETS" engine="MEMORY" >
<column>MAXLEN</column></table>
<table name="COLLATIONS" engine="MEMORY" >
<column>COLLATION_NAME</column></table>
etc...


OK, this is some kind of structure, column tags inside the table tag. But still we have one table tag for every column, but we want one for every table, with all the column tags inside. To accomplish this we add a group by clause, and use the GROUP_CONCAT function for the column tags. This gives us the following SELECT:
select xml_tag('table',null, concat(xml_attr('name',t.table_name),
xml_attr('engine',t.engine)),GROUP_CONCAT(xml_tag('column',c.column_name,null,null) separator ''))
from information_schema.tables t
inner join information_schema.columns c on c.table_name = t.table_name and c.table_schema = t.table_schema
where t.table_schema = 'INFORMATION_SCHEMA'
group by t.table_name, t.engine


And the following result:
<table name="CHARACTER_SETS" engine="MEMORY" >
<column>CHARACTER_SET_NAME</column>
<column>DEFAULT_COLLATE_NAME</column>
<column>DESCRIPTION</column>
<column>MAXLEN</column>
</table>
<table name="COLLATIONS" engine="MEMORY" >
<column>COLLATION_NAME</column>
<column>CHARACTER_SET_NAME</column>
<column>ID</column>
<column>IS_DEFAULT</column>
<column>IS_COMPILED</column>
<column>SORTLEN</column>
</table>
etc

OK, this is what we wanted! Just one small problem left... When I run this on my laptop I get the error 1260 with the following message:
%d line(s) were cut by GROUP_CONCAT()
And the result is only 7 rows. This is because the group_concat_max_len has a value that is too small. You can reset it with the command:

SET [SESSION | GLOBAL] group_concat_max_len = val

I set it with the command:
SET GLOBAL group_concat_max_len = 2048

I then rerun the command, and there is no error, and 23 records.

Code for the stored functions is available at MySql forge.