Florian Reuter's Weblog

Friday, June 26, 2009

API Design Matters

I was reading a very interesting article called "API Design Matters" with the subtitle "Bad application programming interfaces plague software engineering. How can we get things right?". Very cool stuff.

In OpenOffice.org we have an API "plague" too: The ODF import/export is based on the "UNO-API" and so is the OOXML import for Writer. And developers hate these APIs.

So the question is why do developers hate the "UNO-API"? And the obvious --- but wrong answer --- is: "I hate the UNO-API because of UNO". Don't get me wrong here: This is neither about pro or contra UNO. But the statement that "UNO is the problem of the ODF import/export and OOXML import problems" is wrong. It's not UNO per se, but its the design of the API.
[In case you're wondering what "UNO" is: UNO=COM ;-) So UNO is OpenOffice.org's way of COM.]

And just to be sure I do not offend the wrong people: The UNO-API was not designed to be used in the import/export filters. It was designed to be the API for "OpenOffice.org BASIC" developers, i.e. it was designed to provide a similar API to what VBA developers have in Microsoft Office. It was never designed to be used for import/export filters.

The problem was the decision to base the import/export code on such a high-level API! And we suffer from this decision until now!

Anyway. How can we fix this?
a) We claim the current API is the best mankind can do and print T-Shirts with 1000 years of OOo experience.
b) We claim UNO and abstraction is the problem and use the internal legacy APIs, so that we never get a chance to refactor the internal legacy stuff since we're creating even more dependencies.
c) We come up with a better API.

Option a) was demonstrated at the OpenOffice.org conference in Beijing. [Does anybody have a picture of the T-Shirt?]

Option b) is the straightforward approach. E.g. in Writer the “.DOC”, “.RTF”, .”HTML” filters are based on the internal “Core” APIs. So lets use these APIs instead of the UNO-APIs.
Whats wrong with the approach? The problem is that these internal APIs do not abstract from the underlying implementation at all. Repeat: The internal APIs do not abstract from the underlying implementation at all.
Does this answer the question why using the internal APIs is the wrong approach? Obviously *not* having an abstraction between your core implementation details and your import/exports filters is ... [offensive language detected ;-)].

Option c) only has one problem: How should the API look like?

I have some ideas here, but before posting them maybe there are some strong believes out there?

Tuesday, April 15, 2008

Finally we had a developer conference! The good thing is that it was real fun. The bad thing was that I learned and drank toooooo much....

There are some dicussions I'd love to share with you:

Bug handling. Had some interresting chats about bug handling, responsiveness etc. from a developers point of view. Especially from a filter developers point of view. My believe is that we need a better clustering of bugs into problematic areas. This definetly will help to manage espectations as well as quality.
Mail merge. Learned that mail merge is not only broken IMHO but also in the opinion of others. Good (or bad ?:-)). However great things will happen here.
UI. Very good ideas about how to change the UI. Thanks Ricardo that was a great session.
Interop brokeness. Discussed my ideas about how to change ODF and OOo for better interop. Always good to get your ideas “blessed” by the master himself. Thanks Caolan...
Some chats about what to do with http://www.go-oo.org and how to attract more developer. Wait until my VM will appear... ;-)

Beside from the above some interresting news regarding OOXML/ODF/ISO arose. The report from the ISO meeting in Oslo sounds very promising IMHO:


<quote>
SC 34 envisages the creation of three distinct working groups that meet the needs of:
 1. ISO/IEC 29500
 2. ISO/IEC 26300
 3. Work on interoperability/harmonization between document format standards
    and wishes to incorporate existing expertise on these standards.
</quote;>

Only trouble here is that the ODF people do *not* seem to be happy about that --- but I have no idea why?

Overall it was a great week:

~Florian

Tuesday, February 05, 2008

"XML Namespaces are designed to support exactly this kind of thing." (Tim Bray)

We make really good progress on our interoperability work. In our current focus area of fields we extended the OpenOffice.org Writer core for better support of MS Word-like fields. The first feature which benefits from this work are “Input fields” which now support the long wanted "tabbing" feature.

However we want all fields to benefit from the new enhanced field core --- not only "Input fields". Other areas are e.g "Mail merge fields" etc.. Since all of this fields share the same generic mechanism we decided to add support for this generic MS Word-like fields in OpenOffice.org Writer. But by doing so we faced the problem that ODF is not supporting these kind of fields.

Interestingly Tim Bray (Director of Web Technologies at Sun Microsystems) suggested a solution already in November 2005: http://www.tbray.org/ongoing/When/200x/2005/11/27/Office-XML. Unsurprisingly he suggested XML namespaces to solve this problem.

Thats what we did. MS Word-like fields are now stored in the namespace

xmlns:field="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:field:1.0"

which clearly indicates the purpose: OOXML<->ODF interoperability.

The following RelaxNG fragment enhanced the current ODF specification with the new fields:

<define name="paragraph-content" combine="choice">
 <choice>
  <element name="field:fieldmark">
   <attribute name="text:name">
    <ref name="string"/>
   </attribute>
   <attribute name="field:type">
    <ref name="namespacedToken"/>
   </attribute>
   <attribute name="field:locked">
    <ref name="boolean"/>
   </attribute>
         <sequence>
        <ref name="fieldmark-parameter"/>
        <zeroOrMore>
       <ref name="paragraph-content"/>
        </zeroOrMore>  
         <sequence>
     </element>
  <element name="field:fieldmark-start">
   <attribute name="text:name">
    <ref name="string"/>
   </attribute>
   <attribute name="field:type">
    <ref name="namespacedToken"/>
   </attribute>
   <attribute name="field:locked">
    <ref name="boolean"/>
   </attribute>
      <ref name="fieldmark-parameter"/>
  </element>
  <element name="text:fieldmark-end">
  </element>
 </choice>
</define>

In general fieldmarks are very similar to bookmarks, except that they need to be properly nested. This is achieved by the fact, that a field:fieldmark-end does not have a "name" attribute, but instead closes the last opened field:fieldmark-start element.
The field:fieldmark element is a short form of field:fieldmark-start and field:fieldmark-end. It SHOULD preferably be written instead of start-/end marks.

Every fieldmark can have

a name (text:name); similar to the name of text:bookmark elements. They SHOULD be unique. (Preferably also with the bookmark names).
a type (field:type) which allows application to define the type of the fieldmark.
a sequence of associated (name, value) pair represented by the <field:param field:name=”string” field:value=”string”/>.
a locked attribute which specifies whether the user can edit the content or not.

A sample. Lets take a loog at the following sample docs:

The OOXML representation is:

  <w:p>
   <w:r><w:t xml:space="preserve">Title: </w:t></w:r>
   <w:bookmarkStart w:id="0" w:name="Text1"/>
   <w:r>
     <w:fldChar w:fldCharType="begin">
       <w:ffData>
         <w:name w:val="Text1"/>
         <w:statusText w:type="text" w:val="Just a sample field."/>
         <w:textInput/>
       </w:ffData>
     </w:fldChar>
     <w:instrText xml:space="preserve"> FORMTEXT </w:instrText>
     <w:fldChar w:fldCharType="separate"/>
     <w:t xml:space="preserve">A sample input.</w:t>
     <w:fldChar w:fldCharType="end"/>
   </w:r>
   <w:bookmarkEnd w:id="0"/>
 </w:p>
 <w:p>
   <w:r><w:t xml:space="preserve">Description: </w:t></w:r>
   <w:bookmarkStart w:id="1" w:name="Text2"/>
   <w:r w:rsidR="00FA39C2">
     <w:fldChar w:fldCharType="begin">
       <w:ffData>
         <w:name w:val="Text2"/>
         <w:statusText w:type="text" w:val="Yet another sample field..."/>
         <w:textInput/>
       </w:ffData>
     </w:fldChar>
     <w:instrText xml:space="preserve"> FORMTEXT </w:instrText>
     <w:fldChar w:fldCharType="separate"/>
     <w:t>A sample input.</w:t>
   </w:r>
 </w:p>
 <w:p>
   <w:r><w:t>Second sample input paragraph.</w:t></w:r>
   <w:r><w:fldChar w:fldCharType="end"/></w:r>
   <w:bookmarkEnd w:id="1"/>
 </w:p>
 <w:bookmarkStart w:id="2" w:name="Check1"/>
 <w:p>
   <w:r>
     <w:fldChar w:fldCharType="begin">
       <w:ffData>
         <w:name w:val="Check1"/>
         <w:statusText w:type="text" w:val="A sample checkbox..."/>
         <w:checkBox>
           <w:checked/>
         </w:checkBox>
       </w:ffData>
     </w:fldChar>
     <w:instrText xml:space="preserve"> FORMCHECKBOX </w:instrText>
     <w:fldChar w:fldCharType="end"/>
   </w:r>
   <w:bookmarkEnd w:id="2"/>
   <w:r><w:t xml:space="preserve"> Make sense?</w:t></w:r>
 </w:p>

The ODF+Enhancement representation is:

 
   <text:p>Title: <field:fieldmark-start text:name="Text1" field:type="ecma.office-open-xml.field.FORMTEXT"><field:param field:name="Description" field:value="Just a sample field."/></field:fieldmark-start>A sample input.<field:fieldmark-end/></text:p>
   <text:p>Description: <field:fieldmark-start text:name="Text2" field:type="ecma.office-open-xml.field.FORMTEXT"><field:param field:name="Description" field:value="Yet another sample field..."/></field:fieldmark-start>A sample input.</text:p>
   <text:p>Second sample input paragraph.<field:fieldmark-end/></text:p>
   <text:p><field:fieldmark text:name="Check1" field:type="ecma.office-open-xml.field.FORMCHECKBOX"><field:param field:name="Description" field:value="A sample checkbox..."/><field:param field:name="Result" field:value="1"/></field:fieldmark><text:s/>Make sense?</text:p>

Cool isn't it. Or with Tim's words: "Who could possibly be against it?"

Wednesday, January 23, 2008

Never try to catch a train last minute...

Yesterday I tried to catch a train last minute. While running towards it I fell down. I got up again and managed to get it.

While sitting in the train I realized that my arm hurts and at my destination I went into a hospital. The X-rays revealed that my ellbow was broken ;-) Nohting serious --- it'll hopefully heal within two weeks...

So my advice clearly is: Never try to catch a train last minute --- let it pass!

And the moral is: The next train would have departed in only 30 minutes...

Damn!

P.S. In the next two weeks you'll only get short messages from me since I can only type with one hand :-(

Tuesday, December 18, 2007

Back to the binaries! Yeah!

After all this XML work the binary file formats are a different world. For the fields work I needed to analyze the “form field” structure of the binary .DOC format:

The header: Actually a misused PICT structure:

b10	b16	field	Type	size	bitfield	comments
0	0	lcb	U32			Count of bytes of the whole block.
4	4	cbHeader	U16			Always 0x44
6	6		U8[62]			Contains zero. In fact this is the PICT struct, but since its not need we can fill it with zeros.

The formfield payload (Unicode Variant)

b10	b16	Field	Type	size	bitfield	comments
0	0	cUnicodeMarker	U8[32]			Contains {0xFF,0xFF,0xFF,0xFF}
4	4	fftype	U8	:2	03	Type: 0 = Text 1 = Check Box 2 = List
		ffres	U8	:5	7C	Result field for a form field. Values from 0 to N-1, where N is the number of \ffl entries. In case of check boxes: 0==unchecked; 1==checked.
		ffownhelp	U8	:1	80	1 if there is associated Help text, 0 otherwise.
5	5	ffownstat	U8	:1	01	1 if there is associated status line text, 0 otherwise.
		ffprot	U8	:1	02	1 if this field is protected, 0 otherwise.
		ffsize	U8	:1	04	Type of size selected for check box field: 0 = Auto 1 = Exact
		fftypetxt	U8	:3	38	Type of text field: 0 = Regular text 1 = Number 2 = Date 3 = Current date 4 = Current time 5 = Calculation
		ffrecalc	U8	:1	40	1 if the field should be calculated on exit, 0 otherwise.
		ffhaslistbox	U8	:1	80	1 if this field has list box attached to it, 0 otherwise.
6	6	ffmaxlen	U16	:15	7FFF	Number of characters for text field. Zero means unlimited.
			U16	:1	8000	Unknown. Set to zero.
8	8	ffhps	U16			Check box size (half-point sizes).
10	A	xstz_ffname	Xstz_UString0			Form field name
		xstz_ffddeftext	Xstz_UString0			Default text for field. Only if type==0.
		ffdefres	U16			Default resource for list field. Default value for check box (0=default unchecked; 1=default checked). Only if type!=0.
		xstz_ffformat	Xstz_UString0			Format for text field
		xstz_ffhelptext	Xstz_UString0			Help text
		xstz_ffstattext	Xstz_UString0			Status line text
		xstz_ffentrymcr	Xstz_UString0			Macro to execute upon entry into this form field
		xstz_ffexitmcr	Xstz_UString0			Macro to execute upon exit from this form field
		cUnicodeMarker2	U8[2]			Contains {0xFF, 0xFF}; Padding and/or indicator for Unicode?
		fflLen	U32			Num of ffls
		ffl	Xstz_UString[fflLen]			Resource string for lists.

An Xstz_UString has the following form:

b10	B16	Field	type	size	bitfield	Comments
0	0	Len	U16			Len of the String.
2	2	Unicode char	U16[len]			Unicode chars

An Xstz_UString0 has the following form:

b10	B16	Field	type	size	bitfield	Comments
0	0	len	U16			Len of the String.
2	2	Unicode char	U16[len]			Unicode chars
2+2*len		Zero	U16			Trailing “0”

In case of non-Unicode encoding then the Unicode Marker disappear and the string chars have U8 size.

You might also want to take a look at the ffData element in OOXML ;-)

Tuesday, October 30, 2007

Business applications of unstructured text

Interresting article in the ACM Communications.

A widely touted IT factoid states that
80% of the information produced by
and contained in most organizations
is stored in the form of unstructured
data. Most of it is text (such as memoranda,
internal documents, email,
organizational Web pages, and comments
from customers and from
internal service personnel), and most
of the applications that reflect the
value of unstructured data are able to
process it. Although unstructured
data takes other forms, including
images and audio, here I focus on the
applications, technologies, and architectures
for unstructured text acquisition
and analysis (UTAA).

Monday, October 29, 2007

New OpenOffice.org target.

Many of you probaly know the “WONT FIX” target in the OpenOffice.org issue tracker.

What about introducing a new target: “HELPS MICROSOFT”.

But why do we need this? These days many people --- especially from the file formats camps --- are extremely sensitive of anything related to compatiblity 'cause they believe it helps Microsoft.

So lets give the ODF warriors an opportinity to clearly communicate with the users. Give them the “HELPS MICROSOFT” target to publicly exposing the issuer of the bug and the people working on it.

Thursday, October 25, 2007

Field update --- preview for Windows.

I now have a preview for Windows available at http://download.go-oo.org/preview/oodemo.zip.

Simply download it and unzip it. To start execute soffice.exe in ooo2.3/program/.

Same features as the Linux Version. So no saving at this point.

And don't forget to give feedback :-)

Thanks,

~Florian