ref: 7cf8369411a3b9348f862bccaa4147b9aaf38e67
dir: /sys/doc/nupas/nupas.ms/
.EQ delim $$ .EN .TL Scaling Upas .AU Erik Quanstrom [email protected] .AB The Plan 9 email system, Upas, uses traditional methods of delivery to .UX mail boxes while using a user-level file system, Upas/fs, to translate mail boxes of various formats into a single, convenient format for access. Unfortunately, it does not do so efficiently. Upas/fs reads entire folders into core. When deleting email from mail boxes, the entire mail box is rewritten. I describe how Upas/fs has been adapted to use caching, indexing and a new mail box format (mdir) to limit I/O, reduce core size and eliminate the need to rewrite mail boxes. .AE .NH Introduction .LP .DS I Chained at his root two scion demons dwell .br – Erasmus Darwin, The Botanic Garden .DE .LP At Coraid, email is the largest resource user in the system by orders of magnitude. As of July, 2007, rewriting mail boxes was using 300MB/day on the WORM and several users required more than 400MB of core. As of July, 2008, rewriting mail boxes was using 800MB/day on the WORM and several users required more than 1.2GB of core to read email. Clearly these are difficult to sustain levels of growth, even without growth of the company. We needed to limit the amount of disk space used and, more urgently, reduce Upas/fs' core size. .LP The techniques employed are simple. Mail is now stored in a directory with one message per file. This eliminates the need to rewrite mail boxes. Upas/fs now maintains an index which allows it to present complete message summaries without reading indexed messages. Combining the two techniques allows Upas/fs to read only new or referenced messages. Finally, caching limits both the total number of in-core messages and their total size. .NH Mdir Format .LP In addition to meeting our urgent operational requirements of reducing memory and disk footprint, to meet the expectations of our users we require a solution that is able to handle folders up to ten thousand messages, open folders quickly, list the contents of folders quickly and support the current set of mail readers. .LP There are several potential styles of mail boxes. The maildir[1] format has some attractive properties. Mail can be delivered to or deleted from a mail box without locking. New mail or deleted mail may be detected with a directory scan. When used with WORM storage, the amount of storage required is no more than the size of new mail received. Mbox format can require that a new copy of the inbox be stored every day. Even with storage that coalesces duplicate blocks such as Venti, deleting a message will generally require new storage since messages are not disk-block aligned. Maildir does not reduce the cost of the common task of a summary listing of mail such as generated by acme Mail. .LP The mails[2] format proposes a directory per mail. A copy of the mail as delivered is stored and each mime part is decoded in such a way that a mail reader could display the file directly. Command line tools in the style of MH[3] are used to display and process mail. Upas/fs is not necessary for reading local mail. Mails has the potential to reduce memory footprint below that offered by mdirs for native email reading. However all of the largest mail boxes at our site are served exclusively through IMAP. The preformatting by mails would be unnecessary for such accounts. .LP Other mail servers such as Lotus Notes[4] store email in a custom database format which allows for fielded and full-text searching of mail folders. Such a format provides very quick mail listings and good search capabilities. Such a solution would not lend itself well to a tool-based environment, nor would it be simple. .LP Maildir format seemed the best basic format but its particulars are tied to the .UX environment; mdir is a descendant. A mdir folder is a directory with the name of the folder. Messages in the mdir folder are stored in a file named .I "utime.seq" . .I Utime is defined as the decimal .UX seconds when the message was added to the folder. For the inbox, this time will correspond to the .UX “From ” line. .I Seq is a two-digit sequence number starting with .CW "00." The lowest available sequence number is used to store the message. Thus, the first email possible would be named .CW "0.00." To prevent accidents, message files are stored with the append-only and exclusive-access bits turned on. The message is stored in the same format it would be in mbox format; each message is a valid mbox folder with a single message. .NH Indexing .LP When upas/fs finds an unindexed message, it is added to the index. The index is a file named .I "foldername" .idx and consists a signature and one line per MIME part. Each line contains the SHA1 checksum of the message (or a place holder for subparts), one field per entry in the .I "messageid/info" file, flags and the number of subparts. The flags are currently a superset of the standard IMAP flags. They provide the similar functionality to maildir's modified file names. Thus the `S' (answered) flag remains set between invocations of mail readers. Other mutable information about a message may be stored in a similar way. .LP Since the .I info file is read by all the mail readers to produce mail listings, mail boxes may be listed without opening any mail files when no new mail has arrived. Similarly, opening a new mail box requires reading the index and checking new mail. Index files are typically between 0.5% and 5% the size of the full mail box. Each time the index is generated, it is fully rewritten. .NH Caching .LP Upas/fs stores each message in a .CW "Message" structure. To enable caching, this structure was split into four parts: The .CW "Idx" (or index), message subparts, information on the cache state of the message and a set of pointers into the processed header and body. Only the pointers to the processed header and body are subject to caching. The available cache states are .CW "Cidx" , .CW "Cheader" and .CW "Cbody" . .LP When the header and body are not present, the average message with subparts takes roughly 3KB of memory. Thus a 10,000 message mail box would require roughly 30MB of core in addition to any cached messages. Reads of the .CW "info" or .CW "subject" files can be satisfied from the information in the .CW "Idx" structure. .LP Since there are a fair number of very large messages, requests that can be satisfied by reading the message headers do not result in the full message being read. Reads of the .CW "header" or .CW "rawheader" files of top-level messages are satisfied in this way. Reading the same files for subparts, however, results in the entire message being read. Caching the header results in the .CW "Cheader" cache state. .LP Similarly, reading the .CW "body" requires the body to be read, processed and results in the .CW "Cbody" cache state. Reading from MIME subparts also results in the .CW "Cbody" cache state. .LP The cache has a simple LRU replacement policy. Each time a cached member of a message is accessed, it is moved to the head of the list. The cache contains a maximum number of messages and a maximum size. While the maximum number of messages may not be exceeded, the maximum cache size may be exceeded if the sum of all the currently referenced messages is greater than the size of the cache. In this case all unreferenced messages will be freed. When removing a message from the cache all of the cacheable information is freed. .NH Collateral damage .LP .DS I Each new user of a new system uncovers a new class of bugs. .br — Brian Kernighan .DE .LP In addition to upas/fs, programs that have assumptions about how mail boxes are structured needed to be modified. Programs which deliver mail to mail boxes (deliver, marshal, ml, smtp) and append messages to folders were given a common (nedmail) function to call. Since this was done by modifying functions in the Upas common library, this presented a problem for programs not traditionally part of Upas such as acme Mail and imap4d. Rather than fold these programs into Upas, a new program, mbappend, was added to Upas. .LP Imap4d also requires the ability to rename and remove folders. While an external program would work for this as well, that approach has some drawbacks. Most importantly, IMAP folders can't be moved or renamed in the same way without reimplementing functionality that is already in upas/fs. It also emphasises the asymmetry between reading and deleting email and other folder actions. Folder renaming and removal were added to upas/fs. It is intended that mbappend will be removed soon and replaced with equivalent upas/fs functionality — at least for non-delivery programs. .LP Mdirs also expose an oddity about file permissions. An append-only file that is mode .CW 0622 may be appended to by anyone, but is readable only by the owner. With a directory, such a setup is not directly possible as write permission to a directory implies permission to remove. There are a number of solutions to this problem. Delivery could be made asymmetrical—incoming files could be written to a mbox. Or, following the example of the outbound mail queue, each user could deliver to a directory owned by that user. In many BSD-derived .UX systems, the “sticky bit” on directories is used to modify the meaning of the .CW w bit for users matching only the other bits. For them, the .CW w bit gives permission to create but not to remove. .LP While this is somewhat of a one-off situation, I chose to implement a version of the “sticky bit” using the existing append-only bit on our file server. This was implemented as an extra permission check when removing files. Fewer than 10 lines of code were required. .NH Performance .LP A representative local mail box was used to generate some rough performance numbers. The mail box is 110MB and contains 868 messages. These figures are shown in table 1. In the worse case—an unindexed mail box—the new upas/fs uses 18% of the memory of the original while using 13% more cpu. In the best case, it uses only 5% of the memory while using only 13% of the cpu. Clearly, a larger mail box will make these ratios more attractive. In the two months since the snapshot was taken, that same mail box has grown to 220MB and contains 1814 messages. .ps -2 .DS C .TS box, tab(:); c s s s s c | c | c | c | c a | n | n | n | n. Table 1 – Performance _ action:user:system:real:core size: :s:s:s:MB: _ old fs read:1.69:0.84:6.07:135 _ initial read:1.65:0.90:6.90:25 _ indexed read:0.64:0.03:0.77:6.5 .TE .DE .NL .NH Future Work .LP While Upas' memory usage has been drastically reduced, it is still a work-in-progress. Caching and indexing are adequate but primitive. Upas/fs is still inconsistently bypassed for appending messages to mail boxes. There are also some features which remain incomplete. Finally, the small increase in scale brings some new questions about the organization of email. .LP It may be useful for mail boxes with very large numbers of messages to divide the index into fixed-size chunks. Then messages could be read into a fixed-sized pool of structures as needed. However it is currently hard to see how clients could easily interface a mail box large enough for this technique to be useful. Currently, all clients assume that it is reasonable to allocate an in-core data structure for each message in a mail box. To take advantage of a chunked index, clients (or the server) would need a way of limiting the number of messages considered at a time. Also, for such large mail boxes, it would be important to separate the incoming messages from older messages to limit the work required to scan for new messages. .LP Caching is particularly unsatisfactory. Files should be read in fixed-sized buffers so maximum memory usage does not depend on the size of the largest file in the mail box. Unfortunately, current data structures do not readily support this. In practice, this limitation has not yet been noticeable. .LP There are also a few features that need to be completed. Tracking of references has been added to marshal and upas/fs. In addition, the index provides a place to store mutable information about a message. These capabilities should be built upon to provide general threading and tagging capabilities. .NH Speculation .LP Freed from the limitation that all messages in a mail box must be read and stored in memory before a single message may be accessed, it is interesting to speculate on a few further possibilites. .LP For example, it may be useful to replace separate mail boxes with a single collection of messages assigned to one or more virtual mail boxes. The association between a message and a mail box would be a “tag.” A message could be added to or removed from one or more mail boxes without modifying the mdir file. If threads were implemented by tagging each message with its references, it would be possible to follow threads across mail boxes, even to messages removed from all mail boxes, provided the underlying file were not also removed. If a facility for adding arbitrary, automatic tags were enabled, it would be possible to tag messages with the email address in the SMTP From line. .NH References .IP [1] D. Bernstein, “Using maildir format”, published online at .br http://cr.yp.to/proto/maildir.html .IP [2] F. Ballesteros .IR mails (1), published online at http://lsub.org/magic/man2html/1/mails .IP [3] MH Wikipedia entry, http://en.wikipedia.org/wiki/MH_Message_Handling_System .IP [4] Lotus Notes Wikipedia entry, http://en.wikipedia.org/wiki/Lotus_Notes .IP [5] D. Presotto, “Upas—a Simpler Approach to Network Mail”, Proceedings of the 10th Usenix conference, 1985.