Bulk converting doc files into txt (or html)

I have written about converting Microsoft Word files into text or html using OpenOffice before. However,the wizards I described in that article were crashing when the number of files crossed into several hundreds.

I have written some macros to do the conversion,but they were scary looking and fragile. Fortunately,I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.

I have just converted more than a thousand documents from doc to txt without any problems.  Actually,I had a small problem,but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files,kill the Open Office (stop macro did not) and rerun the tool. Otherwise,it just run.

4 comments to Bulk converting doc files into txt (or html)

  • Alex Jaculin

    Once upon the time there was a pretty good tool in the installation of Google Desktop (pdf2txt,ppt2txt,etc).

  • Thanks Alex,

    I never actually installed Google Desktop (not that I totally distrust Google or anything),so haven’t seen those tools. But it is good to know they exist (or is it now ‘existed’?).

  • whist

    I had a similar problem a few years ago. I finally managed to write a delphi pascal program that could convert doc to txt. But it was very slow and had to be baby-sitted (sat?). It’s a perennial problems for corpus building.

  • You could have simply used antiword on the command line. The SVN version of the Web as Corpus ToolKit (WaC TK) includes a module that utilizes antiword to include DOC documents into a corpus.

Leave a Reply

  

  

  

You can use these HTML tags

<a href=""title=""><abbr title=""><acronym title=""><b><blockquote cite=""><cite><code><del datetime=""><em><i><q cite=""><strike><strong>