Converting Doc Files to HTML?

ScottDaMan

New member
Mar 1, 2007
1,321
18
0
I've been doing a lot of content purchases over the last couple of months and most of them deliver in .docx format.

What are you guys doing to convert the docx files to html that you can use on your sites?

Right now I am using DreamWeaver's File > Import > Word Document and then reformatting from there. This method sucks because it leaves a ton of double or even 10 spaces in text.

So then I take that html file and run it through an HTML compressor (I have a shareware software that was free that works, or there's a few good sites online that do this free as well) to remove all the space.

Seriously though, my method sucks. How are you people doing this so that it doesn't take so much damn time? I just want to HTMLize the docs then add my images, links, and go.
 


When I have to work with shitty files like that I just copy and paste it into a notepad app (to remove all the word shit), add an opening <p> tag then swap line breaks with search and replace for </p><p> and close with another </p>. This is then copied and pasted into the site, which takes a few seconds.

Also, if I'm using Jamoola, I just choose from the editor, paste > as plain text - then I don't have to add any tags. Again a few seconds.
 
If I copy the contents from Word to notepad first to "sanitize" it, then move it to a blog or even DreamWeaver, it still doesn't remove double spaces from the writer. Although this did make the paragraphing work great. I think I will do it via notepad instead of using the DW import feature. Seems better. Thanks. It'll save some time at least.
 
When I have to work with shitty files like that I just copy and paste it into a notepad app (to remove all the word shit), add an opening <p> tag then swap line breaks with search and replace for </p><p> and close with another </p>. This is then copied and pasted into the site, which takes a few seconds.

Exactly how I do it. Thanks Guerilla.
 
hey brah,

Hope I understood your question correctly. I'm using wordpress to my sites and here's what I've been using before. The formatting will be lost btw.

Easily converts, 1 click without waiting for a single second.
Convert Word DOC to HTML

Time saved :D your html will be avail right after clicking on convert.

hope that helps out!
Thanks!
 
I used TextFixer yesterday and it leaves all the whitespace in place. Therefore, I have to take that conversion, move it into DreamWeaver, use the "Apply Source Formatting" to clean that up, then I still have to remove magic quotes and other shit. To remove the white space within tags that DW doesn't touch, I have to load the file in a shareware program called Absolute HTML Compressor and that does the rest (after configurations that make it XHTML-strict friendly). To me, this is way too many steps to process 100,000 words of content bi-weekly to prepare it for my custom CMS.

Maybe I'm the only one that is OCD about clean code. OCD to the point that:
Code:
<p>Content is written here. </p> <--that space pisses me off.
I was just wondering what everyone else did. Apparently most people just copy and paste from text and plop that shit into WordPress, save and call it a day. If you looked at that code though, you might see a footprint or two that tells Google that your copy was pasted from another source and not written directly within WordPress.

I'm going to give notepad another shot with the method that -God- suggests using find/replace. It sounds like that will shave a few steps out of this madness.
 
Went past 10 minutes testing things, here's my addition:

With the method that -God- suggests, that doesn't solve the issue either because magic quotes without proper HTML coding is used, double spaces (sometimes 5 spaces) are left, magic quotes has to be F/R, etc.

To me, there seems to be a need here that someone should fill. A proper doc to HTML converter that writes clean code, removes added spacing, converts magic quotes to normal quotes and other Word junk, etc. I'm not a programmer....
 
Finally solved my issue with one program. Word Cleaner but the damn thing is $99. Anyway, the demo was impressive. I edited the template for converting doc to XHTML to do a find/replace of magic quotes and other Word-entered characters, converted a doc file, and it wrote really nice code.

My favorite part of it? It does BATCHES of files. Select an entire folder of DOC files and convert it. Takes a few seconds to do 25 files. The time saved is worth $99 but a tip, switch the currency to the EURO and you'll save $10 or so due to the current conversion to USD being favorable (if you want to license it).