GPF takeout

From Potatopedia

GPF takeout is a project under development whose final goal is to allow any user to export the threads of any Google Group, although the initial goal will be to export the Spanish Google Calendar Forum in order to preserve its threads, which contain a lot of knowledge.

Initial plan for exporting the forums

The initial plan will be to create two pieces of software with which we will be able to export all the threads: a Chrome extension which will collect all the thread IDs and a PHP script which will save the threads into the file system right from the print page. Therefore, the end result will be a series of HTML files with the contents of every thread, just like this one here.

Chrome extension

In order to export the Calendar Forum, the Chrome extension will be activated from [here] using the old GPF interface (not Hadfield) and it will print a JSON string such as the following:

{forum: "calendar", threads: ["jl5iTKctz2M", "D5WQk-OCWxI", "MBSgJ85gE7g"]}

This string will then have to be saved manually into a file which will be used later on.

PHP script

The PHP script will read from the standard input the JSON file and will begin to download all the threads as HTML documents in a specified folder.

Second iteration of the plan

A better option, which is more difficult to develop, is to read the threads output from the Chrome extension and save their contents into a database. The Pistachio Forums software I developed would be used in order to display the threads.

Third iteration of the plan

When running the extension to generate the list of ids for all the forum threads, memory runs out if the forum is pretty big (> 20000 threads, for instance).

Therefore, another PHP script which replaces the extension was developed to extract all the IDs scraping URLs of the form https://productforums.google.com/forum/print/categories/{forum}%5B{first}-{last}%5D, where {forum} is the forum's codename, {first} is the number of the first post we want to retrieve, and {last} is the number of the last post we want to retrieve. Actually, Google Groups only serves this page if less than 100 threads are requested, so we have to create a handful of requests to extract all the IDs (for a forum with ~130000 threads, such as youtube-es, which has now moved to Tailwind, that means making 1300 requests).