Florian Reuter's Weblog

Monday, February 07, 2011

Just came back from FOSDEM. Felt really good to meet the “usual suspects” again. Thanks for the great weekend!

I also had a chance to talk with Jos about ODF Web and ODF Collaboration. Jos gave a great talk about his ODF Web Javascript Framework which emerged from his ODFKit efforts.
Jos had a very important slide in his talk which echoed my own believe: NO CONVERSION! This principle guided the design of his ODF Web Framework. NO CONVERSION simply means that Jos does not try to heuristically (aka lossy) map ODF to HTML and then map HTML heuristically (aka lossy) back to ODF. Instead Jos decided to have a clean 2-tier architecture which cleanly separates the content- and the view layer: ODF is content and HTML is the view. I think that’s the right approach. Even more: I think if you start adding “smart conversions”/”heuristics” and other “intelligent mappings” things will get ugly sooner or later. [And from my experience on OpenOffice.org filter hacking things will get messy sooner than you like. Always keep Murphy’s law in mind: What can go wrong will go wrong!].
We also had a chance to talk about Operational Transformation (OT) in the context of ODF. I tried to argue that what is really missing in ODF is a list of “atomic changes” a user can make to an ODF document. If we had this list of “atomic changes” we could build a transformation on top of it. For OT it is very important that you have “atomic” operations, since you need operation transformations for every pair of operation. E.g. if you have |OPS| operations you need |OPS x OPS| transformations. So keeping |OPS| small is quite important!
Assembling the list of atomic operations is a lot of work --- admitted. However it is work that every designer of an API needs to do anyway. I really believe that some input from the ODF API projects like Oracles’ ODFDOM, IBM’s Simple API for ODF, ANR’s LPOD and Jos’ ODFKit could really help.
Let me finish my post by a classification of change to an ODF document:

I believe that for change tracking we only need “atomic operations” and a way to combine them to “compound operations”. I don’t think we need to be able to track changes to the XML tree or the XML text. In fact I think it does more harm than good.

Tuesday, January 11, 2011

A lot has happened since my last blogpost in June 2009.

Its 2011 and I have been working for more than a year on a new project called “Native OfficeOpenXML” (NOOXML). The story is quite simple: I was very disappointed with the quality of the support of the “docx” format in OpenOffice.org. Even more --- I'm very disappointed with the code quality and the design! of the OpenOffice.org Writer core and layout. There are people who believe this can be solved by “code refactoring” fixing “low-hanging-fruits”, “quick wins” and other magic silver-bullet-phrases. But one thing was for certain: There is no way to (re-)implement a core and a layout engine. Can't be done. Impossible. No way.

OpenOffice.org took the refactoring route. I took the rewrite route.

After one year here is where we are.

What has happened:
I started designing and implementing the NOOXML-core in Jan 2010. The magic is the datastructure which allows a compact representation of the documents and fast implementation of insert/deletion operations etc. I also wanted to be able to do real- time-collaboration, which influenced the design of the core a lot. In March 2010 I was able to load the ECMA Spec Part I (very big document) into the core. Not only on a desktop machine, but also on my “iPod” (not “iPad”!!).
Once I had the basic core design and implementation done I started working on the layout engine. The primary goal was to build a fast and reliable layout engine. In my implementation I focused on OfficeOpenXML fidelity. In August I had the basic layout features like text, headers, footers, tables, footnotes etc. done. I was able to render the ECMA Spec Part I (again: very big document; >5000 pages) to PDF. I then added section and multiple column support.
Yesterday I was able to render the ECMA Spec Part I document on the iPod (real device) AND in the Android emulator (since I don't have an Android device) and without a user interface:

(I know: I took a really long time. But there is sooooo much room for improvements. And hey: OOo can't even load it on a desktop-machine.)

And here is the UI-less port for Android 2.3:

Happy new year!

Tuesday, June 30, 2009

Bulk conversion

Before continuing the “API saga” I needed to have an infrastructure to be able to load a bulk of documents and save them using a certain filter. For me the reason was mainly for testing purposes, however its very convenient for “bulk conversion” too.
The syntax is:


./soffice.bin -bulk [targetDir]/[filterName].[targetExt] [dir] ... [dir]

E.g. the following call will convert all *.odt documents from /home/freuter/tmp/ to “/home/freuter/tmp/out/*.doc” documents using the “MS Word 97” filter:


./soffice.bin -bulk "/home/freuter/tmp/out/MS Word 97.doc" /home/freuter/tmp/*.odt

This command will convert all ~/tmp/*.doc documents to ~/tmp/out/*.odt using the ODF converter:


./soffice.bin -bulk ~/tmp/out/writer8.odt ~/tmp/*.doc

And finally this call will convert all ~/tmp/*.doc document to ~/tmp/out/*.pdf PDF document using the “ writer_pdf_Export” filter:


./soffice.bin -bulk ~/tmp/out/writer_pdf_Export.pdf ~/tmp/*.doc

The patch is here. I additionally fixed a bug in the m_nRequestCount logic and I enabled it in the [Experimental] section.

Friday, June 26, 2009

API Design Matters

I was reading a very interesting article called "API Design Matters" with the subtitle "Bad application programming interfaces plague software engineering. How can we get things right?". Very cool stuff.

In OpenOffice.org we have an API "plague" too: The ODF import/export is based on the "UNO-API" and so is the OOXML import for Writer. And developers hate these APIs.

So the question is why do developers hate the "UNO-API"? And the obvious --- but wrong answer --- is: "I hate the UNO-API because of UNO". Don't get me wrong here: This is neither about pro or contra UNO. But the statement that "UNO is the problem of the ODF import/export and OOXML import problems" is wrong. It's not UNO per se, but its the design of the API.
[In case you're wondering what "UNO" is: UNO=COM ;-) So UNO is OpenOffice.org's way of COM.]

And just to be sure I do not offend the wrong people: The UNO-API was not designed to be used in the import/export filters. It was designed to be the API for "OpenOffice.org BASIC" developers, i.e. it was designed to provide a similar API to what VBA developers have in Microsoft Office. It was never designed to be used for import/export filters.

The problem was the decision to base the import/export code on such a high-level API! And we suffer from this decision until now!

Anyway. How can we fix this?
a) We claim the current API is the best mankind can do and print T-Shirts with 1000 years of OOo experience.
b) We claim UNO and abstraction is the problem and use the internal legacy APIs, so that we never get a chance to refactor the internal legacy stuff since we're creating even more dependencies.
c) We come up with a better API.

Option a) was demonstrated at the OpenOffice.org conference in Beijing. [Does anybody have a picture of the T-Shirt?]

Option b) is the straightforward approach. E.g. in Writer the “.DOC”, “.RTF”, .”HTML” filters are based on the internal “Core” APIs. So lets use these APIs instead of the UNO-APIs.
Whats wrong with the approach? The problem is that these internal APIs do not abstract from the underlying implementation at all. Repeat: The internal APIs do not abstract from the underlying implementation at all.
Does this answer the question why using the internal APIs is the wrong approach? Obviously *not* having an abstraction between your core implementation details and your import/exports filters is ... [offensive language detected ;-)].

Option c) only has one problem: How should the API look like?

I have some ideas here, but before posting them maybe there are some strong believes out there?

Tuesday, April 15, 2008

Finally we had a developer conference! The good thing is that it was real fun. The bad thing was that I learned and drank toooooo much....

There are some dicussions I'd love to share with you:

Bug handling. Had some interresting chats about bug handling, responsiveness etc. from a developers point of view. Especially from a filter developers point of view. My believe is that we need a better clustering of bugs into problematic areas. This definetly will help to manage espectations as well as quality.
Mail merge. Learned that mail merge is not only broken IMHO but also in the opinion of others. Good (or bad ?:-)). However great things will happen here.
UI. Very good ideas about how to change the UI. Thanks Ricardo that was a great session.
Interop brokeness. Discussed my ideas about how to change ODF and OOo for better interop. Always good to get your ideas “blessed” by the master himself. Thanks Caolan...
Some chats about what to do with http://www.go-oo.org and how to attract more developer. Wait until my VM will appear... ;-)

Beside from the above some interresting news regarding OOXML/ODF/ISO arose. The report from the ISO meeting in Oslo sounds very promising IMHO:


<quote>
SC 34 envisages the creation of three distinct working groups that meet the needs of:
 1. ISO/IEC 29500
 2. ISO/IEC 26300
 3. Work on interoperability/harmonization between document format standards
    and wishes to incorporate existing expertise on these standards.
</quote;>

Only trouble here is that the ODF people do *not* seem to be happy about that --- but I have no idea why?

Overall it was a great week:

~Florian

Tuesday, February 05, 2008

"XML Namespaces are designed to support exactly this kind of thing." (Tim Bray)

We make really good progress on our interoperability work. In our current focus area of fields we extended the OpenOffice.org Writer core for better support of MS Word-like fields. The first feature which benefits from this work are “Input fields” which now support the long wanted "tabbing" feature.

However we want all fields to benefit from the new enhanced field core --- not only "Input fields". Other areas are e.g "Mail merge fields" etc.. Since all of this fields share the same generic mechanism we decided to add support for this generic MS Word-like fields in OpenOffice.org Writer. But by doing so we faced the problem that ODF is not supporting these kind of fields.

Interestingly Tim Bray (Director of Web Technologies at Sun Microsystems) suggested a solution already in November 2005: http://www.tbray.org/ongoing/When/200x/2005/11/27/Office-XML. Unsurprisingly he suggested XML namespaces to solve this problem.

Thats what we did. MS Word-like fields are now stored in the namespace

xmlns:field="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:field:1.0"

which clearly indicates the purpose: OOXML<->ODF interoperability.

The following RelaxNG fragment enhanced the current ODF specification with the new fields:

<define name="paragraph-content" combine="choice">
 <choice>
  <element name="field:fieldmark">
   <attribute name="text:name">
    <ref name="string"/>
   </attribute>
   <attribute name="field:type">
    <ref name="namespacedToken"/>
   </attribute>
   <attribute name="field:locked">
    <ref name="boolean"/>
   </attribute>
         <sequence>
        <ref name="fieldmark-parameter"/>
        <zeroOrMore>
       <ref name="paragraph-content"/>
        </zeroOrMore>  
         <sequence>
     </element>
  <element name="field:fieldmark-start">
   <attribute name="text:name">
    <ref name="string"/>
   </attribute>
   <attribute name="field:type">
    <ref name="namespacedToken"/>
   </attribute>
   <attribute name="field:locked">
    <ref name="boolean"/>
   </attribute>
      <ref name="fieldmark-parameter"/>
  </element>
  <element name="text:fieldmark-end">
  </element>
 </choice>
</define>

In general fieldmarks are very similar to bookmarks, except that they need to be properly nested. This is achieved by the fact, that a field:fieldmark-end does not have a "name" attribute, but instead closes the last opened field:fieldmark-start element.
The field:fieldmark element is a short form of field:fieldmark-start and field:fieldmark-end. It SHOULD preferably be written instead of start-/end marks.

Every fieldmark can have

a name (text:name); similar to the name of text:bookmark elements. They SHOULD be unique. (Preferably also with the bookmark names).
a type (field:type) which allows application to define the type of the fieldmark.
a sequence of associated (name, value) pair represented by the <field:param field:name=”string” field:value=”string”/>.
a locked attribute which specifies whether the user can edit the content or not.

A sample. Lets take a loog at the following sample docs:

The OOXML representation is:

  <w:p>
   <w:r><w:t xml:space="preserve">Title: </w:t></w:r>
   <w:bookmarkStart w:id="0" w:name="Text1"/>
   <w:r>
     <w:fldChar w:fldCharType="begin">
       <w:ffData>
         <w:name w:val="Text1"/>
         <w:statusText w:type="text" w:val="Just a sample field."/>
         <w:textInput/>
       </w:ffData>
     </w:fldChar>
     <w:instrText xml:space="preserve"> FORMTEXT </w:instrText>
     <w:fldChar w:fldCharType="separate"/>
     <w:t xml:space="preserve">A sample input.</w:t>
     <w:fldChar w:fldCharType="end"/>
   </w:r>
   <w:bookmarkEnd w:id="0"/>
 </w:p>
 <w:p>
   <w:r><w:t xml:space="preserve">Description: </w:t></w:r>
   <w:bookmarkStart w:id="1" w:name="Text2"/>
   <w:r w:rsidR="00FA39C2">
     <w:fldChar w:fldCharType="begin">
       <w:ffData>
         <w:name w:val="Text2"/>
         <w:statusText w:type="text" w:val="Yet another sample field..."/>
         <w:textInput/>
       </w:ffData>
     </w:fldChar>
     <w:instrText xml:space="preserve"> FORMTEXT </w:instrText>
     <w:fldChar w:fldCharType="separate"/>
     <w:t>A sample input.</w:t>
   </w:r>
 </w:p>
 <w:p>
   <w:r><w:t>Second sample input paragraph.</w:t></w:r>
   <w:r><w:fldChar w:fldCharType="end"/></w:r>
   <w:bookmarkEnd w:id="1"/>
 </w:p>
 <w:bookmarkStart w:id="2" w:name="Check1"/>
 <w:p>
   <w:r>
     <w:fldChar w:fldCharType="begin">
       <w:ffData>
         <w:name w:val="Check1"/>
         <w:statusText w:type="text" w:val="A sample checkbox..."/>
         <w:checkBox>
           <w:checked/>
         </w:checkBox>
       </w:ffData>
     </w:fldChar>
     <w:instrText xml:space="preserve"> FORMCHECKBOX </w:instrText>
     <w:fldChar w:fldCharType="end"/>
   </w:r>
   <w:bookmarkEnd w:id="2"/>
   <w:r><w:t xml:space="preserve"> Make sense?</w:t></w:r>
 </w:p>

The ODF+Enhancement representation is:

 
   <text:p>Title: <field:fieldmark-start text:name="Text1" field:type="ecma.office-open-xml.field.FORMTEXT"><field:param field:name="Description" field:value="Just a sample field."/></field:fieldmark-start>A sample input.<field:fieldmark-end/></text:p>
   <text:p>Description: <field:fieldmark-start text:name="Text2" field:type="ecma.office-open-xml.field.FORMTEXT"><field:param field:name="Description" field:value="Yet another sample field..."/></field:fieldmark-start>A sample input.</text:p>
   <text:p>Second sample input paragraph.<field:fieldmark-end/></text:p>
   <text:p><field:fieldmark text:name="Check1" field:type="ecma.office-open-xml.field.FORMCHECKBOX"><field:param field:name="Description" field:value="A sample checkbox..."/><field:param field:name="Result" field:value="1"/></field:fieldmark><text:s/>Make sense?</text:p>

Cool isn't it. Or with Tim's words: "Who could possibly be against it?"

Wednesday, January 23, 2008

Never try to catch a train last minute...

Yesterday I tried to catch a train last minute. While running towards it I fell down. I got up again and managed to get it.

While sitting in the train I realized that my arm hurts and at my destination I went into a hospital. The X-rays revealed that my ellbow was broken ;-) Nohting serious --- it'll hopefully heal within two weeks...

So my advice clearly is: Never try to catch a train last minute --- let it pass!

And the moral is: The next train would have departed in only 30 minutes...

Damn!

P.S. In the next two weeks you'll only get short messages from me since I can only type with one hand :-(

Tuesday, December 18, 2007

Back to the binaries! Yeah!

After all this XML work the binary file formats are a different world. For the fields work I needed to analyze the “form field” structure of the binary .DOC format:

The header: Actually a misused PICT structure:

b10	b16	field	Type	size	bitfield	comments
0	0	lcb	U32			Count of bytes of the whole block.
4	4	cbHeader	U16			Always 0x44
6	6		U8[62]			Contains zero. In fact this is the PICT struct, but since its not need we can fill it with zeros.

The formfield payload (Unicode Variant)

b10	b16	Field	Type	size	bitfield	comments
0	0	cUnicodeMarker	U8[32]			Contains {0xFF,0xFF,0xFF,0xFF}
4	4	fftype	U8	:2	03	Type: 0 = Text 1 = Check Box 2 = List
		ffres	U8	:5	7C	Result field for a form field. Values from 0 to N-1, where N is the number of \ffl entries. In case of check boxes: 0==unchecked; 1==checked.
		ffownhelp	U8	:1	80	1 if there is associated Help text, 0 otherwise.
5	5	ffownstat	U8	:1	01	1 if there is associated status line text, 0 otherwise.
		ffprot	U8	:1	02	1 if this field is protected, 0 otherwise.
		ffsize	U8	:1	04	Type of size selected for check box field: 0 = Auto 1 = Exact
		fftypetxt	U8	:3	38	Type of text field: 0 = Regular text 1 = Number 2 = Date 3 = Current date 4 = Current time 5 = Calculation
		ffrecalc	U8	:1	40	1 if the field should be calculated on exit, 0 otherwise.
		ffhaslistbox	U8	:1	80	1 if this field has list box attached to it, 0 otherwise.
6	6	ffmaxlen	U16	:15	7FFF	Number of characters for text field. Zero means unlimited.
			U16	:1	8000	Unknown. Set to zero.
8	8	ffhps	U16			Check box size (half-point sizes).
10	A	xstz_ffname	Xstz_UString0			Form field name
		xstz_ffddeftext	Xstz_UString0			Default text for field. Only if type==0.
		ffdefres	U16			Default resource for list field. Default value for check box (0=default unchecked; 1=default checked). Only if type!=0.
		xstz_ffformat	Xstz_UString0			Format for text field
		xstz_ffhelptext	Xstz_UString0			Help text
		xstz_ffstattext	Xstz_UString0			Status line text
		xstz_ffentrymcr	Xstz_UString0			Macro to execute upon entry into this form field
		xstz_ffexitmcr	Xstz_UString0			Macro to execute upon exit from this form field
		cUnicodeMarker2	U8[2]			Contains {0xFF, 0xFF}; Padding and/or indicator for Unicode?
		fflLen	U32			Num of ffls
		ffl	Xstz_UString[fflLen]			Resource string for lists.

An Xstz_UString has the following form:

b10	B16	Field	type	size	bitfield	Comments
0	0	Len	U16			Len of the String.
2	2	Unicode char	U16[len]			Unicode chars

An Xstz_UString0 has the following form:

b10	B16	Field	type	size	bitfield	Comments
0	0	len	U16			Len of the String.
2	2	Unicode char	U16[len]			Unicode chars
2+2*len		Zero	U16			Trailing “0”

In case of non-Unicode encoding then the Unicode Marker disappear and the string chars have U8 size.

You might also want to take a look at the ffData element in OOXML ;-)