我正在尝试做一些带有电子邮件的文本处理语料库。
我有一个主目录,在该目录下有多个文件夹。每个文件夹都有许多.txt文件。每个txt文件基本上都是电子邮件会话。
举例说明我的文本文件与电子邮件的外观,我从公开的安然电子邮件语料库中复制了一个相似的电子邮件文本文件。我在一个文本文件中具有相同类型的文本数据,其中包含多封电子邮件。
示例文本文件如下所示:
Message-ID: <3490571.1075846143093.JavaMail.evans@thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean@enron.com
To: kelly.kimberly@enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc:
X-bcc:
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf
fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49
PM ---------------------------
Joe Hillings@ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron@Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H
Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES,
Jeffrey Sherrick/Corp/Enron@Enron
Subject: Re: India And The WTO Services Negotiation
Sanjay: Some information of possible interest to you. I attended a meeting
this afternoon of the Coalition of Service Industries, one of the lead groups
promoting a wide range of services including energy services in the upcoming
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week
and met with CII to discuss the upcoming WTO. CII apparently has a committee
looking into the WTO. Bob says that he told them that energy services was
among the CSI recommendations and he recalls that CII said that they too have
an interest.
Since returning from the meeting I spoke with Kiran Pastricha and told her
the above. She actually arranged the meeting in Delhi. She asked that I send
her the packet of materials we distributed last week in Brussels and London.
One of her associates is leaving for India tomorrow and will take one of
these items to Delhi.
Joe
Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES,
Jeffrey Sherrick/Corp/Enron@Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation
Sanjay: First some information and then a request for your advice and
involvment.
A group of US companies and associations formed the US WTO Energy Services
Coalition in late May and has asked the US Government to include "energy
services" on their proposed agenda when the first meeting of the WTO GATTS
2000 ministerial convenes late this year in Seattle. Ken Lay will be among
the CEO speakers. These negotiations are expected to last three years and
cover a range of subjects including agriculture, textiles, e-commerce,
investment, etc.
This morning I visited with Sudaker Rao at the Indian Embassy to tell him
about our coalition and to seek his advice on possible interest of the GOI.
After all, India is a leader in data processing matters and has other
companies including ONGC that must be interested in exporting energy
services. In fact probably Enron and other US companies may be engaging them
in India and possibly abroad.
Sudaker told me that the GOI has gone through various phases of opposing the
services round to saying only agriculture to now who knows what. He agrees
with the strategy of our US WTO Energy Services Coalition to work with
companies and associations in asking them to contact their government to ask
that energy services be on their list of agenda items. It would seem to me
that India has such an interest. Sudaker and I agree that you are a key
person to advise us and possibly to suggest to CII or others that they make
such a pitch to the GOI Minister of Commerce.
I will ask Lora to send you the packet of materials Chris Long and I
distributed in Brussels and London last week. I gave these materials to
Sudaker today.
Everyone tells us that we need some developing countries with an interest in
this issue. They may not know what we are doing and that they are likely to
have an opportunity if energy services are ultimately negotiated.
Please review and advise us how we should proceed. We do need to get
something done in October.
Joe
PS Terry Thorn is moderating a panel on energy services at the upcoming World
Services Congress in Atlanta. The Congress will cover many services issues. I
have noted in their materials that Mr. Alliwalia is among the speakers but
not on energy services. They expect people from all over the world to
participate.
因此,如您所见,在一个文本文件中基本上可以有多封电子邮件,除了新的电子邮件标题(“收件人”,“发件人”等)外,分隔规则没有太多明确的规定。
我可以在主目录中执行os.walk,然后遍历每个子目录,解析该子目录中的每个文本文件,等等,然后对其他子目录重复此操作,依此类推。
我需要提取文本文件中每封电子邮件的某些部分,并将其作为新行存储在数据集(csv,pandas数据框等)中。
有助于在数据集中提取并存储为行列的零件。然后,该数据集的每一行可以是每个文本文件中的每个电子邮件。
字段:
Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|
编辑:我看着添加的重复问题。这考虑了固定的规格和边界。这里不是这样。我正在寻找一种提取上述不同字段的简单正则表达式方式
答案 0 :(得分:0)
^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)
确保在正则表达式引擎上使用 dotall ,多行和扩展模式。
对于您发布的示例,它至少可以正常工作,它捕获了不同组中的所有内容(您可能还需要在正则表达式引擎上启用它,具体取决于它是哪个)
Group `date` 63-99 `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender` 106-127 `steven.kean@enron.com`
Group `to` 132-156 `kelly.kimberly@enron.com`
Group `cc` 650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation `
https://regex101.com/r/gHUOLi/1
并使用它遍历文本流,提到python,就可以了:
def match_email(long_string):
regex = r'^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)'
# try to match the thing
match = re.search(regex, long_string.strip(), re.I | re.X)
# if there is no match its over
if match is None:
return None, long_string
# otherwise, get it
email = match.groupdict()
# remove whatever matched from the original string
if email is not None:
long_string = long_string.strip()[match.end():]
# return the email, and the remaining string
return email, long_string
# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
emails.append(email)
email, tail = match_email(tail)
print(emails)
那是直接从here偷来的,只是一些名字被改变了。