How to archive phpBB

Section: Programming

phpBB is a popular forum script. The HiveWorldTerra Forums most recently ran on phpBB3, and previously phpBB2. Eventually the time came to close the forums (to save maintenance effort on a ghost of a community), after nearly fourteen long years of service. Unfortunately, the information on how to archive a forum was minimal.

Here is a summary of my phpBB forum archiving notes.

Initial Preparation

Certain features aren't required for archive forums (e.g. login/logout, posting, reporting, etc), others are pointless (prompts to log-in to view profiles) and others would be useful but can be provided in other ways (Google search is "good enough"). The first step is to make sure that our archive doesn't try to includes these features.

First, go to the Admin Control Panel and add a new "bot" configuration and use the name "wget/". Next, make sure that Bots don't have the "Search" permission.

Once this is done, any requests using the wget command-line tool will be treated as Bot requests and will get the minimal content-only version of the site without interaction options.

Archiving the phpBB Forum

The next step is to archive the forum. This will use a command-line tool called wget to "crawl" the site (follow all of the links it finds) and save copies of the files. Make sure you've got plenty of high-speed network bandwidth for this bit (or run it from the server itsself - ideally in another directory).

To archive the forum, run the following command:

wget -m -p -np -R "*sid=*,ucp.php*,memberlist.php*,*mode=viewprofile*,*view=print*,viewonline.php*,search.php*,posting.php*" https://forums.example.com

wget will then crawl your site and save a copy of each page under its original name.

Breaking down the command

For those who like to know what their commands do, here's an explanation of each argument:

-m
Mirror a site (sets an appropriate group of options)
-p
Download extra resources for the page (images, etc)
-np
No Parent - don't go any higher up the directory structure (prevents wget crawling outside the forum if it is in a sub-directory)
-R
Reject (don't keep) files that match this pattern. Note: wget may still download files that match to crawl them for extra links

Finishing the job

With all of the files downloaded, there are four more small jobs to finish the archiving.

Firstly, move your archived copy of the forums to a sub-folder of your forum called "archive".

Secondly, do some Bash magic to symlink the archived topics so that they're available by post ID

for file in viewtopic.php\?f\=*\&t\=*; do if [ ! -L "$file" ]; then f=$(echo "$file" | grep -oP "(?<=f=)[0-9]+") grep -oP '(?<=#p)[0-9]+' "$file" | while read post; do ln -s "$file" viewtopic.php\?f\=$f\&p\=$post ln -s "$file" viewtopic.php\?p\=$post done fi done

Thirdly, download the Archived Forum scripts and overwrite the existing files. These files either serve up the static page, or show a "not found" or "gone" page as appropriate. Without these scripts, your archived copy won't work because "?" in a file name and "?" in a URL behave in two different ways.

Finally, delete everything you don't need (taking backups first in case you need to revert everything and export again). Drop the database, delete ACP files, caches, documentation, includes, language files and templates, but keep images, CSS and file downloads (plus anything that got added in earlier steps).

Enjoy your archive

It is always a shame when your community dries up and you get to the point of archiving a forum. The Hive World Terra forums were very nearly 14 years old when I closed them. Now that they're archived, though, they can live on for posterity without causing me concern about exploits and vulnerabilities!

Navigation