我有一个巨大的mbox文件,其中可能有500封电子邮件.
它看起来如下:
From x@blah.com Fri Aug 12 09:34:09 2005 Message-ID: <42FBEE81.9090701@blah.com> Date: Fri, 12 Aug 2005 09:34:09 +0900 From: meUser-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: someone Subject: Re: (no subject) References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Status: RO X-Status: X-Keywords: X-UID: 371 X-Evolution-Source: imap://x+blah.com@blah.com/ X-Evolution: 00000002-0010 Hey the actual content of the email someone wrote: > lines of quotedtext
我想知道如何删除所有引用的文本,除去To,From和Date行之外的大部分标题,并且仍然有点连续.
我的目标是能够将这些电子邮件打印成书籍格式,目前每个程序都希望每页打印一封电子邮件,或者所有标题和引用文本.有关从哪里开始使用shell工具制作小程序的任何建议?
Mail :: Box :: Mbox将让您轻松地将文件解析为单独的消息.来自YAPC :: Europe 2002的 Mark Overmeer的幻灯片详细介绍了为什么解析比看起来困难得多.使用这个库还将处理mh,IMAP和许多其他格式,而不仅仅是mbox.
#!/usr/bin/perl use warnings; use strict; use Mail::Box::Manager; my $file = shift || $ENV{MAIL}; my $mgr = Mail::Box::Manager->new( access => 'r', ); my $folder = $mgr->open( folder => $file ) or die "$file: Unable to open: $!\n"; for my $msg ($folder->messages) { my $to = join( ', ', map { $_->format } $msg->to ); my $from = join( ', ', map { $_->format } $msg->from ); my $date = localtime( $msg->timestamp ); my $subject = $msg->subject; my $body = $msg->body; # Strip all quoted text $body =~ s/^>.*$//msg; print <<""; From: $from To: $to Date: $date $body }
您可能需要重新考虑删除引用文本的请求 - 如果您使用交错式回复格式化电子邮件,该怎么办?剥离引用的文本会使这种电子邮件很难理解:
Foo wrote: > I like bar. Bar? Who likes bar? > It is better than baz. Everyone knows that. -- Quux
此外,您打算如何处理附件,非文本/纯MIME类型,编码文本实体和其他奇怪的东西?