Tuesday, June 30, 2009

Bulk conversion

Before continuing the “API saga” I needed to have an infrastructure to be able to load a bulk of documents and save them using a certain filter. For me the reason was mainly for testing purposes, however its very convenient for “bulk conversion” too.
The syntax is:

./soffice.bin -bulk [targetDir]/[filterName].[targetExt] [dir] ... [dir]


E.g. the following call will convert all *.odt documents from /home/freuter/tmp/ to “/home/freuter/tmp/out/*.doc” documents using the “MS Word 97” filter:

./soffice.bin -bulk "/home/freuter/tmp/out/MS Word 97.doc" /home/freuter/tmp/*.odt


This command will convert all ~/tmp/*.doc documents to ~/tmp/out/*.odt using the ODF converter:

./soffice.bin -bulk ~/tmp/out/writer8.odt ~/tmp/*.doc


And finally this call will convert all ~/tmp/*.doc document to ~/tmp/out/*.pdf PDF document using the “ writer_pdf_Export” filter:

./soffice.bin -bulk ~/tmp/out/writer_pdf_Export.pdf ~/tmp/*.doc


The patch is here. I additionally fixed a bug in the m_nRequestCount logic and I enabled it in the [Experimental] section.

Friday, June 26, 2009

API Design Matters

I was reading a very interesting article called "API Design Matters" with the subtitle "Bad application programming interfaces plague software engineering. How can we get things right?". Very cool stuff.

In OpenOffice.org we have an API "plague" too: The ODF import/export is based on the "UNO-API" and so is the OOXML import for Writer. And developers hate these APIs.

So the question is why do developers hate the "UNO-API"? And the obvious --- but wrong answer --- is: "I hate the UNO-API because of UNO". Don't get me wrong here: This is neither about pro or contra UNO. But the statement that "UNO is the problem of the ODF import/export and OOXML import problems" is wrong. It's not UNO per se, but its the design of the API.
[In case you're wondering what "UNO" is: UNO=COM ;-) So UNO is OpenOffice.org's way of COM.]

And just to be sure I do not offend the wrong people: The UNO-API was not designed to be used in the import/export filters. It was designed to be the API for "OpenOffice.org BASIC" developers, i.e. it was designed to provide a similar API to what VBA developers have in Microsoft Office. It was never designed to be used for import/export filters.

The problem was the decision to base the import/export code on such a high-level API! And we suffer from this decision until now!

Anyway. How can we fix this?
a) We claim the current API is the best mankind can do and print T-Shirts with 1000 years of OOo experience.
b) We claim UNO and abstraction is the problem and use the internal legacy APIs, so that we never get a chance to refactor the internal legacy stuff since we're creating even more dependencies.
c) We come up with a better API.

Option a) was demonstrated at the OpenOffice.org conference in Beijing. [Does anybody have a picture of the T-Shirt?]

Option b) is the straightforward approach. E.g. in Writer the “.DOC”, “.RTF”, .”HTML” filters are based on the internal “Core” APIs. So lets use these APIs instead of the UNO-APIs.
Whats wrong with the approach? The problem is that these internal APIs do not abstract from the underlying implementation at all. Repeat: The internal APIs do not abstract from the underlying implementation at all.
Does this answer the question why using the internal APIs is the wrong approach? Obviously *not* having an abstraction between your core implementation details and your import/exports filters is ... [offensive language detected ;-)].

Option c) only has one problem: How should the API look like?

I have some ideas here, but before posting them maybe there are some strong believes out there?