## Tree serialization and deserialization

http://www.sureinterview.com/shwqst/108006

## General idea

The tree serialization is to save its data structure into a flat file, from which the original tree can be recovered (deserialized).

There are a few ways to serialize a tree.

### 1 Serialize the nodes

A common method of serialize a tree or graph is to save the node values and their relationship. For example, given the tree on the right. Each node can be represented by 4 elements, (#, value, #left, #right). By listing all notes, the tree can be serialized as:

0, "a", 1, 2 1, "b", 3, ^ 2, "c", ^, ^ 3, "d", ^, ^

Actually, this is also the general way to serialize a graph. Following code draws this example tree in DOT language. The graph (tree) breaks down into nodes and edges, which can have different properties.

```
```
- digraph{
- node [style=bold, label="\N", shape=record];
- node0 [label="<l> | <v>A | <r>"];
- node1 [label="<l> | <v>B | <r>"];
- node2 [label="<l> | <v>C | <r>"];
- node3 [label="<l> | <v>D | <r>"];
- node0:l -> node1;
- node0:r -> node2;
- node1:l -> node3;
- }

During the serialization process, we need to assign each node a unique id. To efficiently look up that id, the mapping from object to id can be saved into a hash table. If the tree is stored in a list or array, the index of the node can be used as the id instead. Similar hash table mapping from id to object can be used in deserialization.

### 2 Tree traversal

The traversal result of a tree also implicitly determine its topology. For example, the pre-order traversal uniquely describe the example tree above as.

```
```
- a b d ^ ^ ^ c ^ ^

The “^” is for empty node, which is necessary to recover the tree. Because a tree with n nodes has n+1 empty nodes, to save space, multiple empty nodes can be noted as ^#. For example, the above serialized result can be revised to:

```
```
- a b d ^3 c ^2

Check a slide for serializing trees.

A combination of preorder and inorder traversal can also uniquely describe a tree. The example tree has preorder and inorder traversal as follows:

```
```
- a b d c //preorder
- c b a d //inorder
- a (b c) [d] //preorder
- (c b) a [d] //inorder

The first node in preorder list is the root. By searching the root in the inorder list, the traversals are partitioned into left branch and right branch. The branches can be recovered in the same way recursively. This method is not efficient in term of time and space complexity. Saving tree size is better but is still not quite efficient.

### 3 Newick format

Newick format is widely used in bioinformatics. It is a human readable format but needs more effort to serialize and deserialize.

Following code is to deserialize the tree in format of *value(<left branch>,<right branch>)*. The example above is serialized as**0(1(2,),3)**.

```
```
- private BNode makeTreeRec() throws Exception {
- // 1. setup
- BNode root = new BNode();
- char c = treeStr[pos];
- root.value = 0;
- while (c >= '0' && c <= '9') {
- root.value *= 10;
- root.value += c - '0';
- c = treeStr[++pos];
- }
- // 2 left, root, right
- if (c == '(') {
- c = treeStr[++pos];
- if (c != ',') {
- root.left = makeTreeRec();
- }
- if (treeStr[pos] != ',')
- throw new Exception("',' expected at position " + pos);
- c = treeStr[++pos]; // skip ","
- if (c != ')') {
- root.right = makeTreeRec();
- }
- if (treeStr[pos] != ')')
- throw new Exception("')' expected at position " + pos);
- ++pos; // --> )
- }
- return root;
- }

### Code

Code can be found at: http://code.google.com/p/sureinterview/source/browse/src/lib/tree/TreeUtil.java#172