Scraping the Teknoids Mailman PiperMail Archive

Putting this here in case anyone finds themselves in need of something to scrape a Pipermail web archive of a Mailman mailing list. This bit of Python 3 is based on a a bit of Python 2 I found at Scraping GNU Mailman Pipermail Email List Archives. The only changes I made from the original are to update somethings to work in Python 3. It works well for my purposes, generating a single text file of the teknoids list archive from 2005 to today.

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from io import BytesIO

listname = 'teknoids'
url = 'https://lists.teknoids.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
print (filename)
response = requests.get(url + filename)
if filename[-3:] == '.gz':
contents = gzip.GzipFile(fileobj=BytesIO(response.content)).read()
else:
contents = response.content
return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = b"\n\n\n\n".join(contents)

with open(listname + '.txt', 'wb') as filehandle:
filehandle.write(contents)