|
|
Subscribe / Log in / New account

Creating an email archive with public-inbox

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
February 28, 2018
Keeping up with the free-software development community requires following a lot of mailing lists. For many years, the Gmane email archive has helped your editor to do that without going any crazier than he already is, but Gmane is becoming an increasingly unreliable resource. A recent incident increased the priority of a longstanding goal to find (or create) an alternative to Gmane. That, in turn, led to the discovery of public-inbox.

The decline of Gmane

At its peak, Gmane was by far the best way to follow many dozens of mailing lists. It holds archives of a vast number of lists — the front page currently claims over 15,000 — so most of the lists of interest can be found there. Crucially, Gmane offers an NNTP feed; newsreaders are the fastest way that your editor has found to quickly get through a day's email and pick out the interesting messages. Gmane also offered a web-based view into the archive that could be easily linked using a message ID; that made it easy to capture emails and link them back to the context in which they were sent.

Gmane was created by Lars Magne Ingebrigtsen, who operated it for many years before burning out and moving on in 2016. A company called Yomura picked up the archive and continued operating the NNTP feed, but that is where things stopped. The web interface disappeared, never to return, breaking thousands of links across the net. The front page still says "some things are very broken" and links to a blog page that was last updated in September 2016. Gmane has appeared to be on minimal life support for some time.

In mid-February, Gmane stopped receiving emails from every mailing list hosted at vger.kernel.org; those include most of the kernel-related lists, but also lists for other projects like Git. Your editor posted a query and learned that delivery problems had forced Gmane to be dropped from all lists hosted at vger. While this was happening, the main Gmane web page also ceased to work. Since then, a handful of vger lists have returned to Gmane, though the bulk of them remain unsubscribed.

Those lists could certainly be fixed too, if somebody were to find the right person to poke. But the fact that so many high-profile lists could disappear for a week or more without anybody even seeming to notice makes it clear that Gmane is not getting a lot of attention these days. The wait for the web interface to come back is in vain; it's not at all clear that even what's there now is going to last for much longer.

Gmane has served the community well for years; and we all owe the people who have worked to make that happen a huge round of thanks. But all things must end, and it may well be that Gmane's time is coming soon. So what is a frantic LWN editor to do to ensure his ability to keep up with the community?

public-inbox

In the same discussion mentioned above, Konstantin Ryabitsev mentioned that the Linux Foundation is working with a project called public-inbox to create a comprehensive archive for the linux-kernel list. That inspired your editor to go and take a look. The conclusion is that public-inbox may well be the tool for this job, but there are some rough edges to be smoothed out first. The first of those could be said to be the project's web site, which is an unadorned directory listing containing a handful of documentation files.

To summarize: public-inbox can be used to implement an archive for one or more mailing lists. There is a web interface (see the page for the project's own mailing list for an example); it is functional but not necessarily designed for aesthetic appeal. There is a search facility implemented with Xapian that can make it easy to find messages of interest, though it lacks notmuch-style tags. Public-inbox also, happily, implements an NNTP interface to the archive.

Public-inbox, created and almost exclusively developed by Eric Wong, does not appear to have the creation of a Gmane-style mailing-list archive as its primary use case. Instead, it is a tool allowing people to follow (and participate in) mailing lists without the hassle of actually subscribing to them. That shows up in various ways in the design of the system.

For example, there is an interesting design decision at the core of public-inbox: each mailing-list archive is stored in a Git repository. Every incoming message is added to the repository in its own file in a separate commit; the Git history is thus the history of incoming email. A bare Git repository is normally used, so there is no need to duplicate the emails themselves. Viewing an email requires locating its file and checking it out of the repository — though none of that activity is visible to users of the system.

This use of Git would appear to be driven by a desire to make it easy for others to duplicate a specific list archive. And, perhaps more to the point, readers can "subscribe" to the list by periodically pulling new messages from the archive repository. There is a tool (called ssoma) that can be used to feed messages from a public-inbox repository into an email client. When readers get tired of a specific mailing list, they need only stop pulling from the relevant repository; no "unsubscribe" operations are needed. Whether people really want to follow mailing lists in this manner is unclear, but the capability is there.

There are various ways of feeding email into a public-inbox repository. The source comes with an import_maildir script that took many hours to import a 500,000-message linux-kernel archive. It is a somewhat fragile tool, crashing easily on email with malformed headers, but it worked well in the end and public-inbox is quite responsive with an archive of that size — at least, until it decides to run git prune on the repository. The public-inbox-mda utility will read a message from the standard input and inject it into an archive; it is meant to be used from a .forward or .procmailrc file. There is also public-inbox-watch, which will keep an eye on a maildir directory and feed new messages to the archive as they arrive. In general, setting up a new archive is a simple and easily scripted task once one understands how the utilities work.

A young project

The initial commit to the public-inbox repository was made in January 2014, just over four years ago. Since then, some 1,300 commits have built it up to 11,000 lines of code or so. In many ways, though, public-inbox feels like a young project that is still working to get some of the basic functionality in place. It will certainly need some work before it can be used to create archives that run at any sort of scale.

The project's documentation can be accurately described as "spartan", leaving much for the user to figure out on their own. To keep that task from being too easy, many of the commands will just silently fail if something is not set up to their liking. For example, public-inbox-mda will silently drop messages on the floor if the given mailing-list name does not appear in the To or CC headers. Your editor has more than once had to resort to placing print statements in the code (which is all Perl 5, tragically) in order to figure out where things were going wrong.

Other glitches abound. The web interface offers no customization or theming support. The NNTP server does not create proper Xref headers for messages that are cross-posted to more than one list, meaning that a reader of both lists will see a lot of duplicates. There are no tools for monitoring the flow of emails into the archive or troubleshooting problems. The Git-based design could make it interesting to remove an old email from the archive, should that become necessary — from looking at the code, it appears that rebasing the repository would break the archive, though your editor has not actually run this experiment. The X-No-Archive header is not honored. There are concerns about scalability to huge archives. There is also no word about what the project has done, if anything, to ensure the security of code that is exposed to the Internet via the email stream and the HTTP and NNTP ports.

Still, it seems that public-inbox has the core features that are needed to set up a no-nonsense email archive without a huge amount of work. Its simplicity is a nice contrast to something like HyperKitty, which quickly leads a hopeful user into a morass of Django setup and dependencies — and which lacks an NNTP server. There is enough apparent potential here that the Linux Foundation is funding some work to improve the scalability of public-inbox for its linux-kernel archive project. If public-inbox can generate some more interest and grow beyond an essentially single-developer project, it may well come to fill an important niche in our community.


(Log in to post comments)

Creating an email archive with public-inbox

Posted Mar 1, 2018 0:17 UTC (Thu) by xanni (subscriber, #361) [Link] (1 responses)

I'm still using mhonarc for my mailing lists, but I think that's not getting a lot of updates these days either.

Creating an email archive with public-inbox

Posted Mar 1, 2018 5:02 UTC (Thu) by songmaster (subscriber, #1748) [Link]

Ditto on the use of Mhonarc. By generating an index as a PHP file I recently created full-text RSS feeds for my lists, so it’s pretty flexible and still seems to work pretty well.

Creating an email archive with public-inbox

Posted Mar 1, 2018 12:45 UTC (Thu) by lamby (subscriber, #42621) [Link]

Somewhat ironic to read this on "Mailman day"...

mail-archive.com is a mostly viable alternative

Posted Mar 2, 2018 9:35 UTC (Fri) by lacos (guest, #70616) [Link] (2 responses)

The original threaded / "framed" webui of GMANE remains irreplaceable, IMO. The second best choice I've found thus far is mail-archive.com. While it lacks the frames of GMANE (i.e. it doesn't provide a separate frame for the threaded subject lines and for the message body of the currently high-lit subject line), it does offer the following:

- threaded display of the discussion (i.e. subject lines) at the bottom of the screen, below the full message
- threads tracked & linked together across month boundaries
- support for Message-ID based URL format: <http://mid.mail-archive.com/151864824792.12702.1556511371...>. Once the lookup completes, the browser is redirected to a mail-archive.com "native" URL, and the user will be placed into the right context within the thread.
- mail-archive.com handles cross-posted messages well
- it's not difficult to subscribe mail-archive.com to another list; it's well documented in their FAQ

Over the years of using GMANE, I had become complacent with its top-notch service and stability, and ended up pasting the GMANE-native URLs everywhere. Those are precisely the links that broke all over the net when Lars pulled the plug; as Jonathan writes. It taught me a good lesson: even though I could now paste URLs everywhere that are "native" to mail-archive.com, I only use the Message-ID-based format. If mail-archive.com ever goes down, those links will remain usable: people can just scavenge the Message-IDs from the links, and search other archives for them.

Another advantage of the msgid-based links is that I can compose them before mail-archive.com actually receives & indexes the message. I can look at the msgid of the email that I just sent with Thunderbird or git-send-email, compose the link and paste it in a bugzilla comment, for example. The link might not work for the next hour or so, but in time it will start working.

Not trying to detract the "public-inbox" project, of course (this is the first time I read about it); just saying that the demise of GMANE hurt a lot and that mail-archive.com has proved a close second. I'm unsure about NNTP (I haven't used NNTP in quite a few years).

mail-archive.com is a mostly viable alternative

Posted Mar 3, 2018 1:01 UTC (Sat) by pabs (subscriber, #43278) [Link]

BTW, Wikipedia has a list of Message-ID redirectors:

https://en.wikipedia.org/wiki/Message-ID

If you know of any more, please add them.

mail-archive.com is a mostly viable alternative

Posted Sep 19, 2018 14:49 UTC (Wed) by aspiers (guest, #39767) [Link]

I just wrote and published a quick and dirty Python script for converting broken gmane links into public-inbox.org links based on Message-ID:

https://github.com/aspiers/list-utils/blob/master/rescue-...

If you have ideas to improve it, feel free to submit issues, or even better, pull requests.

Creating an email archive with public-inbox

Posted Mar 3, 2018 6:25 UTC (Sat) by darwish (guest, #102479) [Link] (1 responses)

Unfortunately their UI is nothing in comparison to GMane threaded web interface :-(

Why the Linux Foundation decided not to revive GMane instead? Jon mentioned on G+ that they did not want to go that way ...

Creating an email archive with public-inbox

Posted Mar 4, 2018 20:37 UTC (Sun) by marcH (subscriber, #57642) [Link]

https://lars.ingebrigtsen.no/2016/07/28/the-end-of-gmane/
> I’m open to ideas here. If somebody else wants to take over the concept, I can FedEx you a disk containing the archive (as an NNTP spool). I’ve written a lot of software for Gmane, but it’s all quite site specific and un-documented. And the web interface was written in, like, 2004, so it’s way way way un-Web 2.0-ey and shiny. You’re probably better off implementing this stuff from scratch.

Yet another proof of concept which turned into a mission-critical product :-(

Can't wait for those self-driving cars ;-)

Creating an email archive with public-inbox

Posted Mar 4, 2018 7:28 UTC (Sun) by marcH (subscriber, #57642) [Link]

> Instead, it is a tool allowing people to follow (and participate in) mailing lists without the hassle of actually subscribing to them.

The main problem with subscribing to a mailing-list is more than just a "hassle": subscribing only gets you *future* messages.

Creating an email archive with public-inbox

Posted Mar 4, 2018 20:24 UTC (Sun) by jrn (subscriber, #64214) [Link]

The Git project (https://public-inbox.org/git/) has been making heavy use of public-inbox links since it was first created. Some nice features:
- URLs contain the message-id, making it easy to migrate to another service if public-inbox goes away
- Easy to download an mbox with a collection of messages, like download.gmane.org provided
- Fast search, both online (https://public-inbox.org/meta/_/text/help/) and using git commands offline
- The web UI includes instructions for replying to a message using various mail clients
- Mirroring (either publicly or locally) is straightforward
- No ads

If someone makes a nice UI like gmane's framed one on top of this foundation then that would be amazing.


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds