Reencoding MediaWiki pages

From the Linux and Unix Users Group at Virginia Teck Wiki
Revision as of 05:32, 9 November 2009 by imported>Cov
Jump to: navigation, search

The following script is a work in progress. The end goal is to produce w:Wikitext from the HTML of a downloaded w:MediaWiki page. This is going to help importing old wiki contents. Thanks to the Google cache, the HTML output of every page of the old site could be saved, but it looks like the MediaWiki database tables were overlooked when backing up before the server switch.

Script

sed -rn -e '/<!-- start content -->/,/<!--/p' page.html | \
sed -r -e '/<!--/d' \
	-e 's|</?p>||g' \
	-e 's|<br>|<br />|' \
	-e 's|<a href="([^"]*)"[^>]*>([^<]*)</a>|[\1 \2]|g'

Todo

  • Header tags to equal signs
  • Local links to article links