Bulk converting doc files into txt (or html)

I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.

I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.

I have just converted more than a thousand documents from doc to txt without any problems.  Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.

5 thoughts on “Bulk converting doc files into txt (or html)”

  1. Once upon the time there was a pretty good tool in the installation of Google Desktop (pdf2txt, ppt2txt, etc).

  2. Thanks Alex,

    I never actually installed Google Desktop (not that I totally distrust Google or anything), so haven’t seen those tools. But it is good to know they exist (or is it now ‘existed’?).

  3. I had a similar problem a few years ago. I finally managed to write a delphi pascal program that could convert doc to txt. But it was very slow and had to be baby-sitted (sat?). It’s a perennial problems for corpus building.

  4. You could have simply used antiword on the command line. The SVN version of the Web as Corpus ToolKit (WaC TK) includes a module that utilizes antiword to include DOC documents into a corpus.

