Monday, January 08, 2007

WARNING: THIS BLOG IS TOO LONG AND TOO TECHNICAL :-)


How to improve interoperability: A case study with sections


Q: So --- OpenOffice.org Writer has section support. Microsoft Word has section support. Where's the problem?
A: In the details.

OpenOffice.org section


The OpenOffice.org sections are very similar to HTML MULTICOL resp. CSS3 module: Multi-column layout concepts where a sequence of paragraph level content like paragraphs, tables, etc. can be grouped together in a <text:section> and multiple columns can be requested for layout.

Sections in a Writer document have the following form:

WRITERDOC ::= (PARAGRAPH | TABLE | WRITERSECTION)+
WRITERSECTION::= <text:section> (PARAGRAPH | TABLE | WRITERSECTION)+ </text:section>
TABLE ::= TABLEROW+
TABLEROW ::= TABLECELL+
TABLECELL ::= (PARAGRAPH | TABLE | WRITERSECTION)+

So WRITERSECTIONs in OpenOffice.org can start and end anywhere and can be nested.

Microsoft Word sections



Microsoft Word sections are different. Conceptually every Microsoft Word document consists of at least one section. So Microsoft Word documents have the form

WORDDOC ::= WORDSECTION+
WORDSECTION ::= (PARAGRAPH | TABLE)* PARAGRAPH[ with section properties attached to it]

Please note that Microsoft Word sections are always “top-level” and that only a paragraph can trigger a new Microsoft Word section to start after the itself.

So what does this mean for conversion?



a) Every WORDOC can be mapped to a WRITERDOC [not quite true for other reasons, but lets forget about this detail :-)];
b) *NOT* every WRITERDOC can be mapped to a WORDDOC.

Consider the following WRITERDOC:

PARAGRAPH
<text:section>
PARAGRAPH
<text:section>
PARAGRAPH
PARAGRAPH
</text:section>
PARAGRAPH
</text:section>
PARAGRAPH

The above WRITERDOC can not be mapped to a WORDDOC. OpenOffice.org will change the structure on export and write a WORDDOC like

PARAGRAPH + section props
PARAGRAPH + section props
PARAGRAPH
PARAGRAPH + section props
PARAGRAPH + section props
PARAGRAPH

When importing the WORDDOC back into OpenOffice.org Writer the WRITERDOC will look like

<text:section>
PARAGRAPH
</text:section><text:section>
PARAGRAPH
</text:section><text:section>
PARAGRAPH
PARAGRAPH
</text:section><text:section>
PARAGRAPH
</text:section><text:section>
PARAGRAPH
</text:section>

So clearly the structure has changes and roundtrip is broken. You can generated an infinite number of roundtrip problems based on this.

The stuff can get even worse. Consider the following WRITERDOC:

<text:section>
PARAGRAPH
<TABLE>
...
</TABLE>
</text:section>
PARAGRAPH

In order to export this to a WORDDOC you need to add a new paragraph:

PARAGRAPH
<TABLE>
...
</TABLE>
PARAGRAPH + section break properties
PARAGRAPH

since you can only tell a PARAGRAPH to start a new section.

BUT... why are we trying to map WORDSECTIONs to WRITERSECTIONS?



I believe that its better to map between WORDSECTIONs and WRITERMASTERPAGESECTIONs.
Let me try to explain.

In OpenOffice.org Writer you have the concept of “page styles”. When rewriting the above grammar for WRITERDOCs including page styles we get

WRITERDOC ::= WRITERMASTERPAGESECTION+
WRITERMASTERPAGESECTION ::= (PARAGRAPH+master page break before | TABLE + master page break before) (PARAGRAPH | TABLE )*

which is quite similar to

WORDDOC ::= WORDSECTION+
WORDSECTION ::= (PARAGRAPH | TABLE)* PARAGRAPH[ with section properties attached to it]

right? So a “master page break before” attribute can be put to a paragraph or a table and causes the new WRITERMASTERPAGESECTION to start before the paragraph or table. Whereas a “section property “ on a paragraph causes the start of a new WORDSECTION after the paragraph in a WORDDOC. Such a WORDSECTION can start on a new page with new header/footer or be continuous and simply change the columns settings for the following content.

So my favorite idea is to allow a “master page override” property at a paragraph in a WRITERDOC and use this new property to handle WORDSECTIONS.

More concrete I would like to add the following attributes to a WRITERDOC paragraph or table style:

<define name="style-style-attlist" combine="interleave">
<optional>
<attribute name="style:master-page-override">
<ref name="styleNameRef"/>
</attribute>
<attribute name="style:master-page-break">
<choice>
<value>auto</value>
<value>column</value>
<value>page</value>
</choice>
</attribute>
</optional>
</define>

we could then use this attributes to handle WORDSECTION 100% (and discourage the use of WRITERSECTIONS for .DOC interoperability :-))

For backward compatibility also the style:master-page-name="N" attribute could be emitted in case of style:master-page-override="N" and style:master-page-break="page".

E.g. the OfficeOpenXML fragment

<w:body>
<w:p>A1</w:p>
<w:p>A2</w:p>
<w:p>A3<w:pPr><w:sectPr><w:cols w:num="3"></w:sectPr></w:pPr></w:p>
<w:p>B1</w:p>
<w:p>B2<w:pPr><w:sectPr><w:cols w:num="2"><w:type w:val="continuous" /></w:sectPr></w:pPr></w:p>
<w:p>C1</w:p>
<w:p>C2</w:p>
<w:pPr><w:sectPr><w:cols w:num="1"></w:sectPr></w:pPr>
</w:body>

could be translated to the OpenDocument fragment

<office:automatic-styles>
<style:style name="S1" style:family="paragraph" style:master-page-override="P1" style:master-page-break="page"/>
<style:style name="S2" style:family="paragraph" style:master-page-override="P2" style:master-page-break="auto"/>
<style:style name="S3" style:family="paragraph" style:master-page-override="P3" style:master-page-break="page"/>
..
<style:page-layout style:name="PL1">
<style:page-layout-properties>
<style:columns fo:column-count="3"/>
</style:page-layout-properties>
</style:page-layout>
<style:page-layout style:name="PL2">
<style:page-layout-properties>
<style:columns fo:column-count="2"/>
</style:page-layout-properties>
</style:page-layout>
<style:page-layout style:name="PL3">
<style:page-layout-properties>
<style:columns fo:column-count="1"/>
</style:page-layout-properties>
</style:page-layout>
</office:automatic-styles>
..
<office:master-styles>
<style:master-page style:name="P1" style:page-layout-name="PL1"/>
<style:master-page style:name="P2" style:page-layout-name="PL2"/>
<style:master-page style:name="P3" style:page-layout-name="PL3"/>
</office:master-styles>

..
<office:body>
<text:p text:style-name="S1">A1</text:p>
<text:p>A2</text:p>
<text:p>A3</text:p>
<text:p text:style-name="S2">B1</text:p>
<text:p>B2</text:p>
<text:p text:style-name="S3">C1</text:p>
<text:p>C2</text:p>
</office:body>

and back!

I guess this blog entry is far to long and technical by now. Maybe I should move to a WIKI in the future.

~Florian