I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.
I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.
I have just converted more than a thousand documents from doc to txt without any problems. Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.
Once upon the time there was a pretty good tool in the installation of Google Desktop (pdf2txt, ppt2txt, etc).
Thanks Alex,
I never actually installed Google Desktop (not that I totally distrust Google or anything), so haven’t seen those tools. But it is good to know they exist (or is it now ‘existed’?).
I had a similar problem a few years ago. I finally managed to write a delphi pascal program that could convert doc to txt. But it was very slow and had to be baby-sitted (sat?). It’s a perennial problems for corpus building.
You could have simply used antiword on the command line. The SVN version of the Web as Corpus ToolKit (WaC TK) includes a module that utilizes antiword to include DOC documents into a corpus.