I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.
I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.
I have just converted more than a thousand documents from doc to txt without any problems. Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.
May 21st, 2008 at 5:54 am
Once upon the time there was a pretty good tool in the installation of Google Desktop (pdf2txt, ppt2txt, etc).
May 28th, 2008 at 7:05 am
Thanks Alex,
I never actually installed Google Desktop (not that I totally distrust Google or anything), so haven’t seen those tools. But it is good to know they exist (or is it now ‘existed’?).
June 28th, 2008 at 10:21 pm
I had a similar problem a few years ago. I finally managed to write a delphi pascal program that could convert doc to txt. But it was very slow and had to be baby-sitted (sat?). It’s a perennial problems for corpus building.