Extract text from a MediaWiki SQL dump

Published: Mi 29 April 2020
By Nik

In IT.

Extract text from a MediaWiki SQL dump

This was one of my heavier text processing tasks. One problem is to find out where the actual text that you had entered into the wiki can be found in the SQL dump. The SQL dump of MediaWiki is just a big pile. The text we want can be found in just one table, it has the name of the wiki followed by text. So, in the case of a hypothetical LittleKittenWiki, we search for something like INSERT INTO TABLE littlekittenwikitext.

All the text is in just one line, so we can copy it whole. For example, in Vim, a V just marks all of the text for copying.

This gives us a text blob which contains all the versions of the text. They consist of a leading version number and the text, like this: (42,Text …​). So we want to split up into different files. I did it this way:

Split the text up in Vi, using this replacement:

:%s/\((\d\+,'= Little Kitties of the World\)/\r\r\1/g

I had first tried with Sed, but sed, at least in it’s standard versions, cannot output a newline in a search and replace. Then I tried Perl, with perl -p -e …​, but Perl has slightly different Regular Expression that Sed. So I would have to rewrite my regex. Dooh! So I came back to Vim.

It may be possible to write the blocks out to different files, but I didn’t bother to write a Vim macro. So made this separate files with the ever useful Awk:

awk -v RS= '{print > ("littlekitties-part-" NR ".txt")}' littlekittiestext-all.txt

I found this solution on the ever useful Stackoverflow [1].

This gives a bunch of files (in my use case, more than fifty). I assume that the number at the beginning of the block means the version, so higher numbers should represent newer versions. Thus, I took the file with the highest number as a starting point for my text recovery.

links

social