Based on the research that we did on the ZIP file format, we discovered that the reason the first take of the file format was so fat was because the ZIP file format is really inefficient when bundling together a ton of tiny little entries. It’s sad, but true. It’s sad particularly because having a ton of little tiny entries enabled some cool things that we could do in terms of memory and processing efficiency.
Well, we decided that the memory and processing efficiencies weren’t worth the overhead. So the new file format suggests only a handful of bigger entries instead of a ton of little entries. In fact, the GEDCOM 5 conversion utility only uses one entry.
These changes enable a GEDCOM X file to generally be significantly smaller than an equivalent GEDCOM 5 file. The original size of the GEDCOM 5 file is 15 MB. The size of the converted GEDCOM X file using the old utility was 87 MB. The size of the converted GEDCOM X file using the new utility is 3.1 MB. The following example shows the zipinfo of the results of running a GEDCOM 5 file through the converter using the old conversion utility (version 0.1.0) and the new conversion utility (version 1.0.0.M1):
GEDCOM X 0.1.0
GEDCOM X 1.0.0.M1
It’s not as loud.
The resulting file from the old conversion utility was really loud. There was a ton of XML boilerplate that was redundant and we were overusing XML namespaces. In short, the XML was ugly.
We’ve since cleaned that up. Here’s what the XML looks like using the old conversion utility (version 0.1.0) and the new conversion utility (version 1.0.0.M1):
GEDCOM X 0.1.0
GEDCOM X 1.0.0.M1
So did you notice that we’re sticking with XML? As I mentioned before, we’re very aware that there are more efficient serialization formats out there. That’s cool and all, but we’ve decided to just stick with XML. For those of you who are interested in the whole discussion about serialization formats, you’re welcome to read through the thread we opened.
Persistent Issues
We are well aware that there are still some holes that need to be filled in on the conversion utility. It’s still a Java utility and hogs a lot of memory, it doesn’t yet handle all the GEDCOM 5 tags, it doesn’t yet handle all the notes, I’m sure it still crashes on some GEDCOM files, it would be nice to have a GUI for the utility, etc. And I’m sure you’ll find a lot more issues, too.
We’d like to address these issues and any other issues you might find. We’d like to encourage you to open up an issue so we can track that work. In the mean time, this should give you a pretty good idea of what we’ve got today.