In my job, I occasionally need to poke inside the Microsoft Open Office XML formats — both for extracting data from a document or saving XML as a presentation or spreadsheet. I am currently focusing on Excel spreadsheets, though what I say should be applicable to PowerPoint and Word docs as well.
An Excel .xlsx, PowerPoint .pptx, or Word .docx file is actually a zip file composed of a set of XML documents. In fact, if you rename the file by changing the extension to .zip and then unzip it, you can see the path structure and XML documents that are inside of it.
Some text editors allow you to open up an Open Office XML file and enable you to view or edit the individual XML documents inside of it. There are several commercial XML editors that provide full editing support. However, I discovered that the free TextWrangler (Mac) editor allows you to view Open Office documents in read-only mode.
Trying to find an actual XML markup specification of SpreadsheetML and other Open Office XML formats is not exactly straightforward. They can be found, but it requires a lot of digging to find actual useful documentation. The best source I found was the Open XML SDK 2.0 for Microsoft Office which provides SDK docs on the highly cryptic-looking XML elements and attributes contained in these files. It also has a semi-useful “Productivity Tool” that you can use for comparing two Open Office documents. Using this tool, you can browse the actual XML documents when you compare two files. However, while you can view a document tree of the XML docs, the strange thing is that you can’t seem to view the actual raw XML outside of the file comparison mode.
