When writing a web-app that accepts formatted input from users, you’ll often find that they will copy and paste text from Microsoft Word. Unfortunately, Word fills the markup with lots of unnecessary and unwanted muck. To clean this all up, I wrote the following function (directly implemented on the String prototype below):
[sourcecode language=javascript]String.implement({
sanitiseWord: function() {
var s = this.replace(/\r/g, ‘\n’).replace(/\n/g, ‘ ‘);
var rs = [];
rs.push(//g); // Comments
rs.push(/
rs.push(/<(meta|link|.?o:|.?style|.?div|.?head|.?html|body|.?body|.?span|!\[)[^>]*?>/g); // Unnecessary tags
rs.push(/ v:.*?=”.*?”/g); // Weird nonsense attributes
rs.push(/ style=”.*?”/g); // Styles
rs.push(/ class=”.*?”/g); // Classes
rs.push(/( ){2,}/g); // Redundant s
rs.push(/
(\s| )*?<\/p>/g); // Empty paragraphs
rs.each(function(regex) {
s = s.replace(regex, ”);
});
return s.replace(/\s+/g, ‘ ‘);
}
});[/sourcecode]
If you’re not using MooTools, the function will look something like this:
[sourcecode language=javascript]String.prototype.sanitiseWord = function() {
// function body here…
};[/sourcecode]
Usage
[sourcecode language=javascript]var s = “(some awful Word markup)”.sanitiseWord();[/sourcecode]
In one of the tests I ran, the input went from around 7000 characters to just 700.
Example
Some of the regular expressions I used were adapted from C# ones in a post by Jeff Atwood.
Leave a Reply