OpenOffice.org 2: Built for Comfort, Not for Speed?

Controversy swirls around a blog report from a respected columnist that compares OpenOffice.org 2.0 with Microsoft Office — in particular, OpenOffice.org Calc vs. Microsoft Excel — and declares OpenOffice.org to be a pig: slower and fatter in its memory footprint. The comparison is relevant only to Windows XP users — Mac and Linux users would not only have completely different operating environments and different results but also different motivations for using OpenOffice.org.

There’s nothing like a misguided “apples vs. oranges” comparison to get the blog juices flowing. The blog post OpenOffice.org 2.0 is here, but is it a pig? by ZDNet‘s George Ou starts out with the question, “OpenOffice.org 2.0 is finally out with much fanfare, but is it a memory and resource hog?” He concludes that it probably is, based on comparison tests he ran with a large spreadsheet file. Howls of protest ensued, mostly in reaction to the headline and the idea that a simple test of this nature was no real comparison at all.

Typical of the many negative comments generated by this blog is this one, posted by “georgep”:

You start off with a headline implying that OO is a pig. Most readers will not download your file and discover that it is in the end of the tail of the distribution of users. For most users, the ms lost are irrelevant. You should be honest and say that a few power users may find OO too slow, but 99% of users will not notice the difference.

I agree with the above comment with regard to the original blog post. George Ou was not talking about an average file but a very large one — 293,059,603 bytes in size. With average file sizes used in everyday work, the perceived speed differences between Microsoft Office and OpenOffice.org are just not that relevant. My own file-opening tests (with much smaller documents that are typical of my work) demonstrated insignificant results. I didn’t bother to try to refute any of Ou’s statistics or conclusions on memory usage, though other commmentators did.

One coherent explanation, posted as a comment by “DerekBerube”, points out how the comparison doesn’t take into account the added functionality of OpenOffice.org:

The latency that you’re seeing opening the file in OpenOffice is likely due to the fact that [OpenOffice.org] Calc is not only opening the doucment, but also validating the XML structure. Since Excel opens the file much faster, it leads me to conclude that it isn’t doing something that Calc is (namely validaing the structure of the document).

George’s “benchmarks” hardly provide a basis for comparison of the relative strengths and weaknesses of the respective office suites. It just illustrates different implmementation decisions.

Indeed, if you are concerned about document portability (as is the Commonwealth of Massachusetts and other government agencies, as well as large businesses and everyday individuals who detest being locked into commercial software applications), you would not only want to use the standard XML-based format, you’d also want the software to validate it when opening and saving documents.

Many comments questioned George Ou’s motives in writing the original blog post; some speculated that he is on Microsoft’s payroll. He doesn’t seemed to be biased for Microsoft, his opinions are contrary to mine with regard to Web applications. I found several of his articles promoting the idea that rich, fat client applications are better than Web applications — he even challenged the folks at Sun to abandon their desktop applications and try to use a Web implementation of Office. He reported that no one accepted the challenge. The problem with his challenge is that he made it way too soon. Web applications are only beginning to be developed; besides, I now use WordPress (a Web application for writing blog posts and managing blogs) as often as OpenOffice.org or any other client application.

Way down deep in the comments, I came across George’s actual bias, which is an understandable one:

It’s my job in the enterprise to fix “slowness”. That’s what I get paid to do, whack slowness.

Race car drivers tend to see the world as one giant racetrack, and use as their measure the rate of acceleration. The rest of us drivers are more concerned about gas mileage and reliability. Don’t base your choice on such as simple comparison as George Ou’s test. Read other comparisons (I’ll try to keep you posted) and try your own tests — with your own real-world documents.

Share

Comments

OpenOffice.org 2: Built for Comfort, Not for Speed? — 17 Comments

  1. First of all, thanks for the link.

    You said: “One coherent explanation, posted as a comment by “DerekBerube”, points out how the comparison doesn’t take into account the added functionality of OpenOffice.org”

    DereBerube’s comments were speculation. His only “evidence” for CONCLUDING that OO.o validates and MSO doesn’t validate is that Excel is faster.

    By the way, when have you ever seen any published benchmark not test the extremes? It would be like having a college entrance exam where you only tested single integer arithmetic. Also note that the sample is only 50 MBs uncompressed with Excel and loads in 2 seconds. It’s also interesting that you left out the comments on my blog where users say they do deal with large Excel files. In my experience, I know I need large Excel files and I’ve seen many other instances where others need it as well.

    The fact of the matter is, the results for this large file can be scaled down. If you have a file that’s ten times smaller, it will load ten times faster. Instead of 170 seconds, it will be 17 seconds to save a file. That isn’t acceptable to me since I do a lot of saves while I’m working on a large file because I don’t like risking any progress.

    All I do is present the data. It’s up to you to interpret it. Just don’t make anything up to refute it.

  2. Good point that “validation” functionality is speculation — I certainly don’t know what Excel does. I also agree that benchmarks should test extremes, though I don’t agree with your analogy of the exam. A good many (maybe not the majority, but plenty) of the comments questioned whether such large files were relevant for them. I didn’t cite all comments obviously; readers can go see for themselves. But I don’t consider OOo’s speed with large files (arguably slower no matter what you test) the most important point in a comparison with MS Office — which is my point. Calling it a “pig” is a bit sloppy, I think. Perhaps built more for comfort than speed. BTW, how important is support for OpenDoc?

    Thanks for writing.

  3. DOC files are recognized by just about everyone on the planet. Word Perfect users can read/write to it, OO.o users can read/write to it, every other office suite can read/write to it. Don’t tell me that you would distribute documents in ODF because you know most people will not be able to read even two years from now.

    The “open” issue is a read herring to bust up Microsoft and you know it. That’s your goal “getoffmicrosoft” to begin with.

    Microsoft’s new Office 12 format is open; there is just some squabbles about the licensing terms.

  4. Okay, now it looks like the guy here anyway, is just as territorial as Microsoft. “We’ll I can’t explain the bloat, so let’s attack the messenger… Sheesh. Bloat is bloat and isn’t he whole open community just that, open. Open to praise and criticism. Maybe XML is not the standard to use…

  5. Mac and Linux users would not only have completely different operating environments and different results but also different motivations for using OpenOffice.org.

    I run linux on a duron 1000mhz with 512 megs of ram and to me ram has never been an issue. I also didnt experience any lag in my system due to the opening of the xml document. I agree with the author of this blog that different users from different platforms experience different results. The fact is if George wants to base his benchmark of writing to a compressed XML document then so be it. But don’t expect that to be a valid benchmark or testing grounds for the average user. Since the average user is not writing compressed XML files. Even most professionals arent writing to compressed XML files. Actually if you (George) could explain what compressed XML files are used for that would be great because as a computer science science student I have not come across it yet.
    Microsoft office 12 is going to open document…great. now if only i could afford it..hey wait why dont I just get teh Free Alternative..which is what open source is all about. Its not there to take down proprietry software but simply to provide a free alternative to traditional propriotry packages. If you are unhappy about the way it writes compressed xml files..fine if you got the cash then go for it…George says that the money is worth the time you save but is that really so? I am not going to spend $200+ for writing an exel document to a certain format..waste

  6. I wonder if some of the difference in the memory used by Microsoft’s application versus OOo’s app is attributable to it being used by “system” libs for OOo versus userland libs in OOo.

    George, if you’re reading this, did you compare the free memory before and after loading to make sure what Office said was the truth. This is the company that had the motto “Dos ain’t done til Lotus won’t run” and happily placed errors in Windows to toss warnings when it detected a different DOS underneath it, so you’ll forgive me if I have a hard time trusting anything they say, by their mouths or by the tools in their OS.

  7. In many of today’s engineering fields and finance departments, files will most definitely get to be that large. I have generated Excel spreadsheets that size into the high MBs from a database driven customer information site for a company I used to work for (customer information, in a CSV style; these are pure results, and don’t even require any functions or calculations by Excel) and multiple programs from my current job dealing with radiation analysis.

    If you look at the total time column for the separate Excel w/XLS and Calc w/ODF, then you will notice that Calc takes over 2 MINUTES (actually ~2.333 minutes). That’s pretty big compared to the ~2 seconds it takes for Excel. Just looking at these results, I would never suggest my customers use OO.org and instead, use Office. It’s not worth having MY software look slow when it manipulates these files and then outputs it, to have it loading Calc for over 2 minutes. Come on, I have better things to do with my time, and so do my users. That’s enough time to warrant getting up and doing something else, which causes this person to lose their focus, which is a waste of their time.

    Linux users won’t see the difference because they don’t have Excel to compare it with. OO.org uses Java and it is fair to compare the two (OO.org on Linux versus OO.org on Windows, on identical machines), but Linux users bringing up Excel are simply fudging numbers; it’s not the same environment, let alone the same machine. I would be interested in seeing how Mac Excel looks compared to Calc on the Mac.

  8. I think there might be more going on here than meets the eye. Openoffice 2.0 runs fine (and starts quickly) on my home Win98 PIII machine – but runs like the proverbial “Pig on the XP Prox 1.7 celeron machine I use at work – Could MS have got in and knobbled OO 2.0 on XP ??????

  9. Pingback: Netweb » Blog Archive » Performance analysis of OpenOffice and MS Office

  10. Subj: OOo is a good apple. Nothing negative has been proven especially in comparisons with MSO, the orange.

    The cliff notes to the following long post is this:

    OOo is not slow or a pig (though some say it is a bit slower then MSO, at least on Windows). It simply has to parse a text xml file to load/store. MS’s “xml” file is nothing at all like the opendoc xml. george_ou says something about how MSO is still 7 times faster in processing xml, but george, not all xml are created equal. Excel formats are predigested regurgetation, even the xml-ized newer versions. This pre-processed meal has drawbacks, but it does allow for rapid load/store. OOo can improve in this aspect; however, this easily explains the slowness you ascribe categorically to OOo. MSO would be similarly speedy as molasses if it had to load/store a 200M+ content.xml ooo file. I wonder if this is at all a reason why MS doens’t want to support opendoc? [And as a side benefit to Microsoft for not supporting opendoc, they hold the sole keys to the unscrambling of your apparently not very precious data]

    As for memory bloat, OOo has to have a lot more code because it isn’t preloaded like MSO’s code is. MSO shares a lot with Windows code routines but since these are already loaded (since Windows has to work and use these api) we don’t see MSO use or load these. The analyzer you used is almost surely not going to know to add the pre-loaded Windows stuff that MSO borrows from Windows as part of MSO’s memory consumption (Windows already has dibs on it). Additionally, OOO may not make as many Windows api calls as it could (to reduce bloat) because it is a cross-platform package that is likely to have as much platform independent code as possible. Finally, Microsoft is a rotten @#$@#% that can so easily cheat. Remember, at the end of the day, there is no level playing field in testing MSO vs OOO on Windows! The only point is that if you have enough money to blow on Windows and extra expensive software licenses (many do) then you might as well stay with the whole MS stack since MS is unlikely to cheat against its own self (though I can’t be too sure on this). Naturally, you are forever at the mercy of Microsoft that they will not abuse, steal, sell, study, delete, modify, corrupt, or lock your data since they control all the software and the hardware (via Windows). I don’t like those odds, especially considering past MS behavior ([disclaimer] which isn’t a guarantee of future behavior).

    **************************************************

    Several points.

    In part 1, I want to explain some things in general about systems and resources (DLLs and memory consumption, really). It will be an attempt to clarify what others have said. In part 2, I want to mention that the time it takes to open and save files says little about how much of a snail the application is in achieving other results. I’d also like to remind people that Windows is not a neutral playing field and that Microsoft has serious incentives to and is known for hijacking the competition. Finally, I want to thank george_ou because I think it is very possible he has tried to be sincere; regardless, there are potential OOo weaknesses he has brought to our attention, and this can only help improve OOo faster. I’ll end (in part 3) by suggesting 2 “improvements” to OOo [these are improvements in the sense that they may help OOo to behave more like MSO than it appears to do now].

    ******** Part 1 ********

    Do the test results provide conclusive evidence that OOo is a memory hog or even a memory pig? Oink, no!

    I have two main explanations that should make it obvious that we cannot seriously talk about relative pigginess without having access to Microsoft’s Windows and Office source code (including the build instructions) to compare with the OOo code which we do in fact have in totality.

    **** Part 1a ****

    First, is the concept of splitting code between different modules, for example, a split of application code and responsibilities between 2 separate binary programs or between a main program and a library. Here is what I am getting at. Say we have to write an application, so we go ahead and write it all up into one big binary. That would be great except that it would be a waste of space if I then write a second application that does many of the things the first one does (eg, write pixels to the screen or process html or manage pop up menus).

    Long ago computer scientists realized there were ways to reuse code. Consider what a typical computer processor (CPU) does. It grabs bytes from somewhere and interprets the bytes by acting on them. One possible such byte instruction sequence is to store a value in memory. Another possibility is to jump to a particular memory location to fetch the next instruction bytes from there instead of continuing the fetching in order. Another instruction could be to add two numbers.

    To get reuse, one of the most common methods is similar to the following pseudo-code:

    Instruction 1;
    Instruction 2;

    Instruction 23;
    Save current instruction location to a safe place;
    Jump to the reusable procedure location;

    Instruction 26;

    Instruction 2343;
    Instruction 2344;
    Save current instruction location to a safe place;

    Jump to the reusable procedure location;

    Instruction 2347;

    Instruction 2877;
    Save current instruction location to a safe place;

    Jump to the reusable procedure location;

    Instruction 2880;

    Procedure Instruction 10000;
    Procedure Instruction 10001;
    Add memory 500000 with 7 and store it back into 500000;
    Procedure Instruction 10003;

    Procedure Instruction 10251;
    Return to location stored in safe place;

    What the above demonstrates is that there is a very simple way to reuse code in programs. The basic idea is to put the common code in a particular area of memory (in the above example, the procedure starts at 10000 and lasts for 253 instructions) and then to have an instruction jump to this procedure starting location (10000) whenever the application wants to execute those procedure instructions. The only trick is to remember to store where we were coming from so that the CPU knows where to return (this is what allows the procedure to be reused from many different locations without hard-coding the specific jump-back location ahead of time into the procedure). The alternative to reusing code is simply to replicate the entire batch of instructions (the procedure) everywhere we want. This is silly since some procedures are typically executed in programs from many many many places. E.g., writing something to the monitor requires basically the same monotonous group of commands except that the actual thing that is written is variable (so we parameterize that variable item but reuse the general instructions block). In fact, not only can we reuse/call the procedure from many parts within a single application, we can reuse it across applications (Acrobat Reader does many similar things as AutoCAD).

    I hope the above explanation is not insulting to anyone. It is difficult to just guess that computers sort of work as described above if you don’t have experience with programming (through a computer science classes or by reading a good book). Also, I want to make sure we are all on the same page before continuing.

    So, Microsoft has these 10 applications, all of which process html, draw to the screen, initialize a window, etc. All of these commonalities can be abstracted into functions and related functions usually are further grouped into specific libraries (or DLLs). Really, the operating system’s whole reason for being is to provide nothing but many of these abstractions so as to make the life of application builders easy and also to keep different applications from stepping on each other’s toes [If each application decided to store their information in the same location, they would interfere with each other. This is why generally the application programming language doesn’t let you pick specifically where to put things but defers that decision to the operating system and then the application accesses its data and its jump-to instruction target locations as offsets (relative locations) from what is given to it at run time by the operating system].

    Here is where the testing results (eg, by george_ou) about memory usages come into play and where they can allow one to come to wrong conclusions. Let me assume that Microsoft is a saint. For my second point coming up (part 1b), I will discuss this issue of saintliness, but right now I can assume Microsoft is quasi-holy.

    I write MS Office (quite a task for a single individual, I know, but I am still mostly mortal, do not fear). What I am struggling with is how much code to put inside this “application” called Office and how much to put inside a “DLL” located somewhere in the bowels of Windows. Finally, after conferencing with my collegues at Microsoft I come to a decision (yeah right, anywhere else in the world this get-together would have happened way before the product was finished (umm, well, ah…), but don’t be surprised if Microsoft works a little crazy in this sort of way.. you never know).

    Let’s see, Internet Explorer, the Window Manager app, Explorer, MSPoltergeist, whatever, these all have many things in common, so I decided to partition out of the main application about 75% of the code. In other words, the MSOffice proper will have 25% of the code and the rest will be reuse from DLLs. The point is that I could have made it less, I could have made it more, I had a lot of lee way. What was taken into account was the perspective that I, working for Microsoft and being so close to Windows which provides so many services for everyone, had access to a lot of procedures that Windows coders already provided. Microsoft makes many general procedures (because Windows — an operating system — supports everyone — all apps) and Microsoft makes a lot of applications. Overall, while some of the procedures and libraries are available to people outside Microsoft, naturally there are internal libraries that are not very general or haven’t been made available to 3rd parties (yet). For example, Adobe has it’s own DLLs that they reuse among Adobe applications. OOo has DLLs that they reuse among its separate component apps. Intuit has …. Similarly Microsoft has these private DLLs. The thing is, and this is the main tricky point to note for now, that MS Office shares these internal DLLs with Windows. Why? Because MS Office and MS Windows are made by the same company, Microsoft. Thus if you load Windows (duh, it’s the operating system) you already have pieces of MS Office loaded and they may not show up in any memory tester thing since this is code that really belongs to Windows. Only Microsoft has this memory “advantage” because Adobe and everyone else has to load these at runtime (or pre-load but george_ou doesn’t do that), while MS Office already has these utility DLL’s loaded and functioning within Windows. Further, because of Microsoft’s vast scope and Windows vast scope, these private MS utility DLLs can be quite significant in size… i.e. if MS Office needs the functionality, chances are that some other MS application (or Windows itself) already needed it and had it loaded. A more accurate split could easily be 10%/90% or even more lopsided if the “app” is just a shell with all the real work being done by libraries [This skewed ratio is very common in linux (eg, see the shell app wrapper of OOo) since the main app can be a command-line program for X, Y, or Z platform, a visual windowed program or ten, or can be incorporated into some other unrelated app simply via the calls to the library workhorse routines .. providing lots of reuse and integration.. yes, integration, despite what you may have heard on the FUD-vine about Linux].

    Again, the point here is that it is natural to have DLLs that are used by a company, but even if you are Microsoft, you have not had the time, inclination, or need (they are saintly remember) to provide it as public. [Just think of Adobe private DLL reuses and then apply the concept to Microsoft since not everything is going to be a public Windows interface as it is code that is specific to the programming style and corporate history and culture of the software company.]

    We can add to this the fact that OOo is multiplatform and thus there could be cases where they can choose to use MS public DLL procedures but don’t and instead build their own in order to keep the distance between the Windows version and the Linux and Mac versions close (i.e., minimize the platform dependencies). This is not a trivial point. Remember OOo is multiplatform. You have to expect more “bloat” simply because there is less integration and reliance on Windows DLL shortcuts then if OOo was Windows only.

    The previous paragraphs deal with memory bloat (non) issues. Further on, in part 2, I cover slow open/save.

    **** Part 1b ****

    Now, let me get into my second point and discuss Microsoft’s lack of saintliness. What this means is that there are even greater reasons to get: Home team (Microsoft) 77, Visiting team (OOo) 14. Whether we are talking about memory hogging or about hog slowness.

    To get to the point: Microsoft, they ain’t no saints.. period. There are marketing reasons and just hardball reasons why they would want to keep some DLL’s private or provide bad or bloated public versions. Keeping good DLL’s private allows MSO/MS to stand out from the field in terms of performance (speed). Keeping lot’s of stuff private in general, allows MSO/MS to stand out from the field in terms of memory usage: MS is lean and MS Office is saving you from buying loads of extra memory.

    Also, I’d like to note that all of these utilities used to collect system vitality signs and process stats have to rely on what Windows spits out to it. No one but the operating system and its designated driver can know the computer system details. All such utilities are either created by Microsoft, or are created by third parties but must defer to the info Windows provides for it. No 3rd-party application can go over Windows’ head (except through a bug in Windows or just awful engineering) and are limited to regurgitating the accurate or inaccurate stats provided by Windows. [Remember that applications are timed in and out to share in the virtual (fake) multitasking on the single CPU (or >1 CPU but usually less CPU’s than software processes running). When and for how long they time out, only Windows programmers in Microsoft (and some MS execs) really know.. it is part of Microsoft’s Windows’ secret sauce.. don’t be surprised if MS Office or any other MS app kicks royal ttub in any Windows sponsored taste test. Windows decides how many clock cycles to give to all competitors and they then report the end results (as they see it).]

    Microsoft controls the whole beauty pageant, and I repeat, they ain’t no saints, period. There veil is thick, and it is to the credit of the people that have put OOo (and its predecessors) together that so much about MS’ secret handshakes and cryptic formats were able to be deduced.

    [To really kill the real really big hog, we need to remove Windows and install Linux, really.]

    ******** Part 2 ********

    Let me reference this posting: http://www.zdnet.com/5208-10533-0.html?forumID=1&threadID=14492&messageID=294600 i.e., at http://blogs.zdnet.com/Ou/?p=120 post by yourme dated Oct 31 2005

    Basically what it says is that OOo can be the most kick-ttub app ever, but it will necessarily require a certain amount of time to digest any really large xml file. This would also be true of MSO (except for the kick-ttub part) except that Excel files are not xml text files that require heavy duty processing and parsing. I am guessing at this, but it is much more reasonable that Excel’s binary files (even their pseudo-xml formats) are memory dumps of the already digested data (plus some scrambling added to throw off the reverse engineers). This is what gives the speed-up in loading and saving, and as a side effect, excel formats are difficult to be understood by non MSO programmers (but can be very fragile and likely to go “poof” at the whims, loss of finacial incentives, or simple incompetency of Microsoft).

    This makes too much sense and can help identify yet one more reason why MS would not want to go head to head against OOo and others in loading/saving to opendoc text xml formats since they would have no advantage of the quicker format. [i.e., they wouldn’t be able to get their typical handicapped 990 meter head start in the 1000 meter race]

    Here is an example of xml processing vs dumping:

    A file may look like this: [332 characters @ 1 bytes per character in length]

    h-stuffmore-h-stuff

    cell value cell2
    cello helloworld
    texthere

    Conceptually, this may lead to the following variables and values in memory after parsing:
    documents=1
    document[0].default_font=SOME_DULL_FONT
    document[0].default_italicized=false
    document[0].header=true
    document[0].subheaders=2
    document[0].subheader[0].value=”h-stuff”
    document[0].subheader[1].value=”more-h-stuff”
    document[0].pages=1
    document[0].page[0].tables=1
    document[0].page[0].table[0].rows=3
    document[0].page[0].table[0].row[0].cells=2
    document[0].page[0].table[0].row[0].cell[0].value=”cell value”
    document[0].page[0].table[0].row[0].cell[0].font=DEFAULT_FONT
    document[0].page[0].table[0].row[0].cell[0].italicized=DEFAULT_ITAL
    document[0].page[0].table[0].row[0].cell[1].value=” cell2″

    What happens is that OOo has to take the file at top and look at each byte one by one, and as it sees each byte, it has to determine what item in the conceptual model to fill up. As the link above shows (link to george_ou article talkback), there can be many instructions and testing to figure out where in the xml file we are in. At some point, it is know and some of the data is saved in the proper place (eg, when we figure out the cell we are currently parsing has a value of “cell value” or that the font was not specified and so defaults to some value) and then the parsing continues. After a lot of processing, we get the full memory model. All of this processing is time that can be saved by Excel and by any other app that saves in a more raw format. I’ll describe the raw format in a second.

    So the above conceptual model is just to make things clear and it could be how this might look as programming code (looks more like JavaScript than C code but this is just hypothetical anyway). But in actual computer memory the above might be expressed as follows:

    byte 0: 1
    byte 1: 0
    byte 2: 0

    byte 3: 1
    byte 4: 2
    bytes 5 through 6: 7
    bytes 7 through 14: ‘h’, ‘-‘, ‘s’, ‘t’, ‘u’, ‘f’, ‘f’
    byte 15: ….

    I don’t feel like continuing the above in hypothetical detail nor do I feel like calculating the total bytes in memory this hypothetical would yield, but if I did go through this exercise completely, the total number of bytes used might come out to be a high number but likely less than 332. Let’s hypothesize 250 to make things concrete.

    We have to understand that the above memory description (byte xxx: yyy) is a short-cut because if I put in the actual values (eg, 0 or 1 or 2 or 7 etc.. think ascii) we could not see it in text or in html and hence in this post. If we tried to print the actual memory as a dump to the screen, all of these 0’s, 1’s, 2’s, 7’s, etc could easily end up as nonprintable characters or funny looking symbols or worse, but the point is that in memory this number of bytes (250) is the space that would be required to capture this entire document. Now, assuming that this data is as above (250 compact bytes without padding, aligning, etc all in contiguous memory) then we can spit this out as MyRawFile.raw.OOo very very quickly. Certainly quicker than evaluating the data and regenerating all of the tags of an xml file. There would be no reverse-parsing but just a quick write of these 250 bytes from buffer to file.

    To reload this file, we would not have to parse any xml file. Instead, we would just slurp back 250 bytes straight into memory. Under this method, we could in short time be worrying about displaying the effect of these 250 bytes while the old OOo is still only starting its parsing, having gone through say 10 bytes and 240 instructions that check and test and set parse state variables and copy to buffers, etc. [Reference the BASIC-English pseudo-code mentioned in the talkback linked at the beginning of Part 2 above. It gives an idea of how it can take tens of CPU instructions to do all the checking and updating necessary to process just a few characters from any xml file (and we have 332 total bytes to scan and parse). This contrasts very sharply to using these tens of cycles to slurp tens of no-need-to-be-processes bytes.]

    Dumping/undumping slaughters ordinary text unparsing/parsing.

    Of course, the remaining major step of printing to the screen is not free, but it is much cheaper than the parsing we thankfully just bypassed via pre-digested file formats. As a sanity check: 600×800=about half million pixels have to be sent to the monitor 50 times a second = 25 million bytes to first approximation in one second (modern CPU clocks and motherboard bus clocks and/or the clocks of video cards are fast enough for this). Now parsing a 320 byte file, while slower than flying through 250 raw bytes, is still doable in much under a second. However, how might a 200M+ xml file, such as the one done in the real life trial, match against the pixel thing above?

    First the 250 million bytes require the much slower disk access (say, 1000 times slower than memory accesses which is how/where pixels are stored). They require many instructions per byte to process vs pixel data that is all ready to go. It isn’t hard to believe that one can take minutes while the other a portion of 1 second. The conclusion is that for a 200M+ file from disk that needs parsing, we can take much longer than a few seconds. If we regurgitated such a large file, we could do it at least 10 times quicker and then we’d be left with some in memory processing (or deobfuscation) and the pixel processing and sending to monitor which is all done in memory and is in manageable (if large) quantities. [note, someone posted that the ms file was several times smaller than 200M+ so that we can get a factor of several tens in time speedup instead of just 10 as guesstimated above (the 10, under the assumption of similar sized OOo and MSO files (text xml parse vs regurgitated stuff), is apparently in the ballpark, judging by the test results)]

    We can take this rawness stuff to an even higher level by actually saving on our hypothetical .raw.OOo file not just the text stuff and meta data but the actual pixelization that will be produced on the screen. This may be overkill though, especially for an app that wants to be cross-platform. There are just too many possible way to pixelize and too many varying contexts (not to mention that the default fonts may differ on each computer, the drawing system calls differ widely depending of the DLL’s (or equivalent libraries found in other platforms), etc). Still, it would be interesting to see how fast one could get OOo to open and show a file using all the pre-loading tricks possible and as much raw data as possible.

    [On Windows, it is possible MS has special secret system calls just for this sort of pixelization thing. Basically, the idea is to get all of the preprocessing out of the way that we can and then have specialized calls that just take the data and run with it.. without sacrificing portability to different machines (eg, different monitor characteristics). Since Windows works on a more limited selection of hardware platforms than Linux (and hence OOo), MSO can “cheat” to a higher degree than OOo is ever likely to be able to do… but that assumes MS-folk are as clever! And the extra speedup may not be large anyway]

    The above is a very rough approximation. There is a lot more info that would be in the document file (like creation time, author, colors, cell sizes, etc), and the pseudo-programming in the examples, as well as the primitive computer workings descriptions in this post and the other talkback, are very very simplified and only a very rough rough approximation to real code.

    ******** Part 3 ********

    Two ways to have OOo be more like MSO.

    Recommendation #1: Have Calc digest only one page (instead of all 16) and then display it and be responsive while it digests the rest at the time it is requested by the user (or in the background). This is very simple to do with xml. I would guess that this is already done, but these tests appear to say that maybe Calc tries to process the whole thing before rendering any visible single spreadsheet. This is just a usability/engineering decision that has little to do with file formats: do users want to wait for all to be done (so that second click is quick) or do they want to get started reading and playing with the visible spreadsheet number 1 as soon as possible? [Probably MSO outdoes OOo in the approach here] If OOo parser insists on processing the whole file first then maybe an alternate or new xml parser lib should be used. [SAX2 would allow this I think]

    Recommendation #2: Have OOo use opendoc as an official file format and as a common file transfer dialect with other office suites and apps, but allow it to use another native format that is essentially a straight data dump of the file contents as it exists in memory (well, a direct dump may be impossible but maybe something very close and easy to re-assemble). This way there is no time consuming 200M+ parsing but the file can just be slurped (almost?) straight into memory. This feature can be the default, or it can trigger for cases where OOo deduces that the system resources and file size may warrant such action (or based on some configured parameters). This feature can also form part of an auto-save mechanism (like emacs or even MSO). It would be quick yet contain enough data to recapture everything in memory and then generate the opendoc xml. There are drawbacks to this approach, but I think it could be used in concert with normal opendoc file saving to improve the overall process as it allows the user to regain control quicker after a save thereby reducing the risk of corruption to some extent (mostly because the user would be less hesitant to save frequently or the auto feature would do it and not be manually disabled by the user out of fustration). At any time afterwards (as a background process that doesn’t slow down the OOo responsiveness or at a later point in time like at the closing of OOo or every 24 hours), the native raw contents can be converted to the nice, shareable, scriptable, and robust xml.

  11. Part 2 mostly makes sense, but the conclusion may be incorrect, and the analysis at the end is certainly flawed (very skimppy). I got sloppy on the analysis (wrote it up last night in a rush). Basically, the disk access issue was thrown in at the end, but that is very significant, and I should have spent more time thinking about it [I was already doing a very rough analysis without even confirming the test results by george_ou (or profiling or looking at code….).]

    Disk access is expensive. The reasoning to Part 2 should have been that keeping a small xml file, though it may take say 10 times longer to process than some raw file, can come out quicker in the end because of all of the disk time saved. I considered that 200 and some megabytes was the size of the content.xml file being processed by OOo (it is if you expand it), but on disk it is a mere 3 megabytes or so and not the 200+. Thus the huge amount saved in reading the small file (“huge” when we contrast processor speed with disk access and disk read speed) works in OOo’s favor and may ultimately lead to a win for OOo if it can manage to take near linear time to process the file and take no more than say 10 or 100 times longer to process/parse the file bytes over a straight copy to memory (assuming disk access is 1000+ times slower).

    I would like to retract Part 3, Recommendation #2. Also, I would like to nullify the analysis of Part 2 though the examples are ok and may be helpful. The examples are true but do not consider that disk access is much slower than memory access. I just assumed all was done from/to same type of hardware [really, we have processor cache/register: faster than motherboard bus and memory: faster than disk transfer.]

    Another problem with the earlier post is that I used the less than greater than signs in the post and because of the publishing as html some parts got deleted.

    What I decided to do is to just repost Part 1 with the things that got deleted and anyone can read the earlier post if they want to mull over the retraction above.

    *****************************************************

    ******** Part 1 ********

    Do the test results provide conclusive evidence that OOo is a memory hog or even a memory pig? Oink, no!

    I have two main explanations that should make it obvious that we cannot seriously talk about relative pigginess without having access to Microsoft’s Windows and Office source code (including the build instructions) to compare with the OOo code which we do in fact have in totality.

    **** Part 1a ****

    First, is the concept of splitting code between different modules, for example, a split of application code and responsibilities between 2 separate binary programs or between a main program and a library. Here is what I am getting at. Say we have to write an application, so we go ahead and write it all up into one big binary. That would be great except that it would be a waste of space if I then write a second application that does many of the things the first one does (eg, write pixels to the screen or process html or manage pop up menus).

    Long ago computer scientists realized there were ways to reuse code. Consider what a typical computer processor (CPU) does. It grabs bytes from somewhere and interprets the bytes by acting on them. One possible such byte instruction sequence is to store a value in memory. Another possibility is to jump to a particular memory location to fetch the next instruction bytes from there instead of continuing the fetching in order. Another instruction could be to add two numbers.

    To get reuse, one of the most common methods is similar to the following pseudo-code:

    {program instructions in memory start here}
    Instruction 1;
    Instruction 2;

    Instruction 23;
    Save current instruction location to a safe place; {i.e., store 24}
    Jump to the reusable procedure location; {i.e., jump to location 10000}
    Instruction 26;

    Instruction 2343;
    Instruction 2344;
    Save current instruction location to a safe place; {i.e. store 2345}
    Jump to the reusable procedure location; {i.e., jump to location 10000}
    Instruction 2347;

    Instruction 2877;
    Save current instruction location to a safe place; {i.e. store 2878}
    Jump to the reusable procedure location; {i.e., jump to location 10000}
    Instruction 2880;

    {here is the reusable procedure location where we jump to in the above 3 cases; thus we reuse the following set of instructions those 3 times instead of having to reproduce them above}
    Procedure Instruction 10000;
    Procedure Instruction 10001;
    Add memory 500000 with 7 and store it back into 500000; {this is just an example of an instruction, namely “Procedure Instruction 10002”}
    Procedure Instruction 10003;

    Procedure Instruction 10251;
    Return to location stored in safe place; {this is instruction 10252, but more importantly, it “returns” or jumps to either 24, 2345, or 2878, or to anywhere else if this procedure was “called” from anywhere else using the above protocol}

    What the above demonstrates is that there is a very simple way to reuse code in programs. The basic idea is to put the common code in a particular area of memory (in the above example, the procedure starts at 10000 and lasts for 253 instructions) and then to have an instruction jump to this procedure starting location (10000) whenever the application wants to execute those procedure instructions. The only trick is to remember to store where we were coming from so that the CPU knows where to return (this is what allows the procedure to be reused from many different locations without hard-coding the specific jump-back location ahead of time into the procedure). The alternative to reusing code is simply to replicate the entire batch of instructions (the procedure) everywhere we want. This is silly since some procedures are typically executed in programs from many many many places. E.g., writing something to the monitor requires basically the same monotonous group of commands except that the actual thing that is written is variable (so we parameterize that variable item but reuse the general instructions block). In fact, not only can we reuse/call the procedure from many parts within a single application, we can reuse it across applications (Acrobat Reader does many similar things as AutoCAD).

    I hope the above explanation is not insulting to anyone. It is difficult to just guess that computers sort of work as described above if you don’t have experience with programming (through a computer science classes or by reading a good book). Also, I want to make sure we are all on the same page before continuing.

    So, Microsoft has these 10 applications, all of which process html, draw to the screen, initialize a window, etc. All of these commonalities can be abstracted into functions and related functions usually are further grouped into specific libraries (or DLLs). Really, the operating system’s whole reason for being is to provide nothing but many of these abstractions so as to make the life of application builders easy and also to keep different applications from stepping on each other’s toes [If each application decided to store their information in the same location, they would interfere with each other. This is why generally the application programming language doesn’t let you pick specifically where to put things but defers that decision to the operating system and then the application accesses its data and its jump-to instruction target locations as offsets (relative locations) from what is given to it at run time by the operating system].

    Here is where the testing results (eg, by george_ou) about memory usages come into play and where they can allow one to come to wrong conclusions. Let me assume that Microsoft is a saint. For my second point coming up (part 1b), I will discuss this issue of saintliness, but right now I can assume Microsoft is quasi-holy.

    I write MS Office (quite a task for a single individual, I know, but I am still mostly mortal, do not fear). What I am struggling with is how much code to put inside this “application” called Office and how much to put inside a “DLL” located somewhere in the bowels of Windows. Finally, after conferencing with my collegues at Microsoft I come to a decision (yeah right, anywhere else in the world this get-together would have happened way before the product was finished (umm, well, ah…), but don’t be surprised if Microsoft works a little crazy in this sort of way.. you never know).

    Let’s see, Internet Explorer, the Window Manager app, Explorer, MSPoltergeist, whatever, these all have many things in common, so I decided to partition out of the main application about 75% of the code. In other words, the MSOffice proper will have 25% of the code and the rest will be reuse from DLLs. The point is that I could have made it less, I could have made it more, I had a lot of lee way. What was taken into account was the perspective that I, working for Microsoft and being so close to Windows which provides so many services for everyone, had access to a lot of procedures that Windows coders already provided. Microsoft makes many general procedures (because Windows — an operating system — supports everyone — all apps) and Microsoft makes a lot of applications. Overall, while some of the procedures and libraries are available to people outside Microsoft, naturally there are internal libraries that are not very general or haven’t been made available to 3rd parties (yet). For example, Adobe has it’s own DLLs that they reuse among Adobe applications. OOo has DLLs that they reuse among its separate component apps. Intuit has …. Similarly Microsoft has these private DLLs. The thing is, and this is the main tricky point to note for now, that MS Office shares these internal DLLs with Windows. Why? Because MS Office and MS Windows are made by the same company, Microsoft. Thus if you load Windows (duh, it’s the operating system) you already have pieces of MS Office loaded and they may not show up in any memory tester thing since this is code that really belongs to Windows. Only Microsoft has this memory “advantage” because Adobe and everyone else has to load these at runtime (or pre-load but george_ou doesn’t do that), while MS Office already has these utility DLL’s loaded and functioning within Windows. Further, because of Microsoft’s vast scope and Windows vast scope, these private MS utility DLLs can be quite significant in size… i.e. if MS Office needs the functionality, chances are that some other MS application (or Windows itself) already needed it and had it loaded. A more accurate split could easily be 10%/90% or even more lopsided if the “app” is just a shell with all the real work being done by libraries [This skewed ratio is very common in linux (eg, see the shell app wrapper of OOo) since the main app can be a command-line program for X, Y, or Z platform, a visual windowed program or ten, or can be incorporated into some other unrelated app simply via the calls to the library workhorse routines .. providing lots of reuse and integration.. yes, integration, despite what you may have heard on the FUD-vine about Linux].

    Again, the point here is that it is natural to have DLLs that are used by a company, but even if you are Microsoft, you have not had the time, inclination, or need (they are saintly remember) to provide it as public. [Just think of Adobe private DLL reuses and then apply the concept to Microsoft since not everything is going to be a public Windows interface as it is code that is specific to the programming style and corporate history and culture of the software company.]

    We can add to this the fact that OOo is multiplatform and thus there could be cases where they can choose to use MS public DLL procedures but don’t and instead build their own in order to keep the distance between the Windows version and the Linux and Mac versions close (i.e., minimize the platform dependencies). This is not a trivial point. Remember OOo is multiplatform. You have to expect more “bloat” simply because there is less integration and reliance on Windows DLL shortcuts then if OOo was Windows only.

    The previous paragraphs deal with memory bloat (non) issues. [Retracted sentence]

    **** Part 1b ****

    Now, let me get into my second point and discuss Microsoft’s lack of saintliness. What this means is that there are even greater reasons to get: Home team (Microsoft) 77, Visiting team (OOo) 14. Whether we are talking about memory hogging or about hog slowness.

    To get to the point: Microsoft, they ain’t no saints.. period. There are marketing reasons and just hardball reasons why they would want to keep some DLL’s private or provide bad or bloated public versions. Keeping good DLL’s private allows MSO/MS to stand out from the field in terms of performance (speed). Keeping lot’s of stuff private in general, allows MSO/MS to stand out from the field in terms of memory usage: MS is lean and MS Office is saving you from buying loads of extra memory.

    Also, I’d like to note that all of these utilities used to collect system vitality signs and process stats have to rely on what Windows spits out to it. No one but the operating system and its designated driver can know the computer system details. All such utilities are either created by Microsoft, or are created by third parties but must defer to the info Windows provides for it. No 3rd-party application can go over Windows’ head (except through a bug in Windows or just awful engineering) and are limited to regurgitating the accurate or inaccurate stats provided by Windows. [Remember that applications are timed in and out to share in the virtual (fake) multitasking on the single CPU (or >1 CPU but usually less CPU’s than software processes running). When and for how long they time out, only Windows programmers in Microsoft (and some MS execs) really know.. it is part of Microsoft’s Windows’ secret sauce.. don’t be surprised if MS Office or any other MS app kicks royal ttub in any Windows sponsored taste test. Windows decides how many clock cycles to give to all competitors and they then report the end results (as they see it).]

    Microsoft controls the whole beauty pageant, and I repeat, they ain’t no saints, period. There veil is thick, and it is to the credit of the people that have put OOo (and its predecessors) together that so much about MS’ secret handshakes and cryptic formats were able to be deduced.

    [To really kill the real really big hog, we need to remove Windows and install Linux, really.]

  12. Just so there is no confusion, I do not now nor have I ever worked for Microsoft. The discussion above was for the illustration. E.g. Suppose I was trying to describe the thrill of being certain types of plants (um, my great great grandsomething was a tree, if you didn’t know): “Ok, so I am a large vine. Now, do I really want to strangle that bed of roses? Yes. Absolutely. I get a high slowly getting my grip around the stems, knowing their thorns provide no barrier against my tenacity….blah blah….”

    So please, don’t think I work for Microsoft. I would like to think I wasn’t that far off, but I was just role playing.

  13. George Ou observes:

    > DOC files are recognized by just about everyone on the planet.

    Undeniably; but amongst those people, I wonder how many have learnt to treat .doc-related and Word-related problems as an *acceptable* part of their computing experience?

    Years of investment by a company as rich as Microsoft should have produced an office suite that’s wholly reliable, supreme, a magnet, a gold standard worthy of its cost. Instead: there is a mass movement to seek a better alternative, and not just for reasons of cost. How massive a movement? Massive enough to worry Microsoft.

    Word is less than reliable. Fact.

    We should be universally dumbfounded that Microsoft have failed to produce a near-perfect product after so many years. Instead: I guess that “just about everyone on the planet” has learnt to accept the occasional, sometimes shocking problems associated with Word.

    OOo has flaws, we can’t deny, but I gain the impression that it’s less problematic than Word. See for example this extract from a discussion at Slashdot. My own experience is that OOo is the more reliable.

    Certainly I have found OOo to be a godsend on various occasions, as it has managed to open colleagues’ Word files that have become completely unreadable by Word itself. (Document corruption, presumably.) It ain’t right.

    In my IT support role: years of problems with Word and other Microsoft products have led to me offering this concise initial response when colleagues report problems with Word:

    “Yes, that’s right.”.

    If I can fix it, I’ll fix it, but experience has taught me to start with low hopes, with a slim chance of pleasure only if the fix is successful. It should be the other way around: from the outset, I should have high hopes and near-certainty of pleasure to be gained through problem resolution. It’s a lottery, it’s a mess, it just ain’t right.

    I’m bashing unashamedly, and with good cause. In the past I have been utterly appalled at Microsoft’s response to major issues (“we’ll review this one in around a year”) that they described as “known issues” — but only if you actually jumped through hoops to quiz them on the subject. You review Microsoft’s published product information, Microsoft’s lists of known issues, guess what:

    Microsoft do not publish all known issues. Fact. And I’m talking about basic compatibility issues — things that prevented Word from launching — not issues that required non-disclosure.

    Enough said … end rant!

    DOC files may be recognized by just about everyone on the planet, but ubiquity does not equal quality.

Leave a Reply

Your email address will not be published. Required fields are marked *