Convert from xml to json and from json to xml. Simple and robust.

It will parse any valid XML to dict, and back to XML. It will also parse any reasonable dict to XML and back to Python dict.

For the duration of this read, let’s pretend JSON == Python dict. As long as long we don’t try to JSON-encode a dict containing some (unserialisable) objects, this will hold true. Also we shouldn’t care (deeply) about the order of attributes. Now for the fun…

In XML element can have any number of attributes and any number of child-elements. This gives XML it’s flexibility, and at the same time keeps things in check by defining this simple and strict rule.

Python dictionary (a.k.a. dict) is somewhat more limited: it consists of key-value pairs only. The “value” part can be anything, e.g. another dict. Key is most often a string, though any “hashable” object will do but is rarely used.

There is one fundamental difference between these two formats. XML elements (“tags” if you will) can have two types of information: attributes and children-elements. Dict only has key-value pairs, that is only one type of information. This fundamental difference complicates making a universal converter between the two.

In order to make conversion from one to the other, we have to come up with a way to add one additional “type of information” to elements in a dict, one additional “dimension”, “degree of freedom”… There are many implementations on the Web that do something like this:

<tag attr="value">text value</tag>

{'tag' : {
    'attributes' : {'attr' : 'value', },
    'text' : 'text value'
}}

An approach like this can work just fine, but there are several shortcomings in it. Mainly, the structure of a dict has to be set quite rigid. That makes this an ok “xml2json” converter, but a lousy “json2xml”. In this example, we gain the additional type of information by proclaiming the “attributes” and “text” keys to have special meaning. Note that we didn’t consider child-elements, for the sake of simplicity.

xml2json2xml converter utilizes somewhat different approach to achieve that additional type of information. As mentioned earlier, dict key can be any hashable object. On the other hand, XML attribute has to be one nice word. Namely, XML attribute can’t begin with a space ” “, while dict key can. We can use this difference to create the additional degree of freedom – all keys starting with a space are translated to XML in attributes, other keys are translated into elements. For example:

<tag attr="value"><elem>element text</elem></tag>

{'tag' : {
    ' attr' : 'attrib value',              # note the space before "attr"
    'elem'  : 'element text',
} }

Careful reader will notice that here we actually use two different rules for specifying a XML element. The two elements (“tag” and “elem”) have very different values. Former has a dict for value, while latter has a string! Ok, I cheated a little, but stay tuned, it will pay back…

We still have to figure out a way to specify text value and attribues on one element. For this we can once again use the very same trick – special keys that cannot exist in xml anyway. For specifying text value, I found empty string (“”) to be most elegant solution. With this piece of the puzzle, converter is almost complete:

<tag attr="value">text value</tag>

{'tag' : {
    ' attr' : 'value',
    ''      : 'text value',
}}

If instead* of text value our element should have some child-elements, that we would not use the “” key and we would specify the child elements within the main dict (see “elem” in the previous example).
* XML supports situations when one element has both text values and child-elements,  we will address this later on (but will not be able to support it 100%)

Even with just these definitions, our converter can be usefull in a variety of cases. However, there are still some things that we have to address.

Repeating elements – in XML there is no reason why an element shouldn’t have several identical child-elements:

<root>
    <elem>element text</elem>
    <elem>another element text</elem>
    <elem>yet another element text</elem>
</root>

Python dicts, however, must have their keys unique. Values can be anything. So let’s make value a list of elemnts:

{'root': {
    'elem': [
        'element text',
        'another element text',
        'yet another element text',
        ],
}}

This corresponds to the XML example above. We could exchange and on of the “element text” string for a dict describing a sub-subelement.

This is as far as we go with this converter. Lets summarize what we have not achieved:

  • only “well behaved” JSON dict can be converted to XML – these are the restrictions:
    •  only keys and values that can be converted to strings
    • no list-in-list situations
  • order is not preserved in some cases
    • in cases where a series of same-tag elements are interrupted by a different element, dict will gather all same-tag elements to one list, thus loosing the original ordering

 

Reference examples:

The simple element:
{ 'tag' : 'value' }

<tag>value</tag>
Attribute:
{ 'tag' : { '_attr' : 'value' } }

<tag attr="value"/>
Text and attribute:
{ 'tag' : { '' : 'text value', '_attr' : 'value' } }

<tag attr="value">text value</tag>
Attributes and children:
{'tag': {
    '_attr': 'value',
    'ch1': 'text1',
    'ch2': 'text2',
}}

<tag attr="value">
    <ch1>text1</ch1>
    <ch2>text2</ch2>
</tag>
Repeating elements:
{'tag': {
    '_attr': 'value',
    'ch1': [
        'text11',
        'text12',
        'text13',
        ],
    'ch2': 'text2',
}}

<tag attr="value">
    <ch1>text11</ch1>
    <ch1>text12</ch1>
    <ch1>text13</ch1>
    <ch2>text2</ch2>
</tag>
Some combined stuff:
{'tag': {
    '_attr': 'value',
    'ch1': [
        {'': 'text11', '_attr11': 'val11'},
        {'': 'text12', '_attr12': 'val12'},
        {'': 'text13', '_attr13': 'val13'},
    ],
    'ch2': 'text2',
}}

<tag attr="value">
    <ch1 attr11="val11">text11</ch1>
    <ch1 attr12="val12">text12</ch1>
    <ch1 attr13="val13">text13</ch1>
    <ch2>text2</ch2>
</tag>

TODO:

  • link to bitbucket