Difference between revisions of "Reencoding MediaWiki pages"

From the Linux and Unix Users Group at Virginia Teck Wiki
Jump to: navigation, search
imported>Cov
imported>Cov
Line 1: Line 1:
The following script is a work in progress. The end goal is to produce [[w:Wikitext]] from the HTML of a downloaded [[w:MediaWiki]] page. This is going to help importing old wiki contents. Thanks to the Google cache, the HTML output of every page of the old site could be saved, but it looks like the MediaWiki database tables were overlooked when backing up before the server switch.
+
The following script will convert HTML from a [[w:MediaWiki|MediaWiki]] page to [[w:Wikitext|Wikitext]]. The script was written to facilitate the 2009 [[VTLUUG servers|server]] migration.
  
 
=Script=
 
=Script=
 
<pre>
 
<pre>
sed -rn -e '/<!-- start content -->/,/<!--/p' page.html | \
+
## CLEANUP ##
sed -r -e '/<!--/d' \
+
# Comments
-e 's|</?p>||g' \
+
/<!--/d
-e 's|<br>|<br />|' \
+
# Table of contents
-e 's|<a href="([^"]*)"[^>]*>([^<]*)</a>|[\1 \2]|g'
+
/<table id="toc/,/<\/table>/ d
 +
# Paragraph tags
 +
s|</?p>||g
 +
# Anchor tags
 +
s|<a name="[^"]*"></a>||g
 +
# Make breaks XHTML
 +
s|<br>|<br />|g
 +
# Quotation marks
 +
s|’|'|g
 +
s|“|"|g
 +
s|”|"|g
 +
 
 +
## WIKIFY ##
 +
# Italics and bold
 +
s|</?i>|''|g
 +
s|</?b>|'''|g
 +
# Headings
 +
s|<h1>.*>(.*)</span></h1>|=\1=|g
 +
s|<h2>.*>(.*)</span></h2>|==\1==|g
 +
s|<h3>.*>(.*)</span></h3>|===\1===|g
 +
s|<h4>.*>(.*)</span></h4>|====\1====|g
 +
# Internal links
 +
s|<a href="http://vtluug.org/wiki/[^>]*>([^<]*)</a>|[[\1]]|g
 +
# External links
 +
s|<a href="([^"]*)"[^>]*>([^<]*)</a>|[\1 \2]|g
 +
</pre>
 +
 
 +
=Running=
 +
The following command will create .wikitext files of the HTML files for your cut and paste convenience.
 +
<pre>
 +
for f in *.html ; do $( sed -rn -e '/<!-- start content -->/,/<!--/p' $f | sed -r -f script > $f.wikitext ) ; done
 +
</pre>
 +
 
 +
=Copying=
 +
Once the .wikitext files are generated, you can simply open them up, edit them by hand if necessary, and copy and paste them into MediaWiki. Noting that this is an import in the summary box is recommended.
 +
<pre>
 +
gedit *.wikitext
 
</pre>
 
</pre>
  
=Todo=
+
=Effectiveness=
* Header tags to equal signs
+
The script was effective enough for our purposes when written, but it has some shortcomings. Images and local article links are handled poorly and it does not attempt to produce the brace-bar-dash table markup.
* Local links to article links
+
 
 +
[[Category:Scripts]]

Revision as of 07:19, 13 November 2009

The following script will convert HTML from a MediaWiki page to Wikitext. The script was written to facilitate the 2009 server migration.

Script

## CLEANUP ##
# Comments
/<!--/d
# Table of contents
/<table id="toc/,/<\/table>/ d
# Paragraph tags
s|</?p>||g
# Anchor tags
s|<a name="[^"]*"></a>||g
# Make breaks XHTML
s|<br>|<br />|g
# Quotation marks
s|’|'|g
s|“|"|g
s|”|"|g

## WIKIFY ##
# Italics and bold
s|</?i>|''|g
s|</?b>|'''|g
# Headings
s|<h1>.*>(.*)</span></h1>|=\1=|g
s|<h2>.*>(.*)</span></h2>|==\1==|g
s|<h3>.*>(.*)</span></h3>|===\1===|g
s|<h4>.*>(.*)</span></h4>|====\1====|g
# Internal links
s|<a href="http://vtluug.org/wiki/[^>]*>([^<]*)</a>|[[\1]]|g
# External links
s|<a href="([^"]*)"[^>]*>([^<]*)</a>|[\1 \2]|g

Running

The following command will create .wikitext files of the HTML files for your cut and paste convenience.

for f in *.html ; do $( sed -rn -e '/<!-- start content -->/,/<!--/p' $f | sed -r -f script > $f.wikitext ) ; done

Copying

Once the .wikitext files are generated, you can simply open them up, edit them by hand if necessary, and copy and paste them into MediaWiki. Noting that this is an import in the summary box is recommended.

gedit *.wikitext

Effectiveness

The script was effective enough for our purposes when written, but it has some shortcomings. Images and local article links are handled poorly and it does not attempt to produce the brace-bar-dash table markup.