我有点坚持如何继续下去,所以一点点推动会非常有帮助。
我有~1800个文本文件,实际上是电子邮件,是重复格式。
每个文件的结构如下:
From: Person-1 [email@person-1.com]
Sent: Tuesday, April 18, 2017 11:24 AM
To: email@person-2.com
Subject: Important Subject
User,
Below is your search alert.
Target: text
Attribute: text
Label: abcdef
Time: Apr 18, 2017 11:24 EDT
Full Text: Text of various length exists here. Some files even have links. I'm not sure how I would capture a varied length field.
Recording: abcde & fghijk lmnop
这是它的要点。
我想将其写入DF,我可以存储为CSV格式。
我想结束这样的事情?
| Target | Attribute | Label | Time | Full Text | Recording | Filename |
|--------|-----------|---------|--------|-------------|-----------|----------|
| text| text| abcdef| (date) |(Full text..)|abcde & f..| 1111.txt |
| text2| text2| abcdef2| (date) |(Full text..)|abcde & f..| 1112.txt |
第二行是另一个文本文件。
我有代码遍历所有文本文件并打印出来。这是代码:
# -*- coding: utf-8 -*-
import os
import sys
# Take all text files in workingDirectory and put them into a DF.
def convertText(workingDirectory, outputDirectory):
if workingDirectory == "": workingDirectory = os.getcwd() + "\\" # Returns current working directory, if workingDirectory is empty.
i = 0
for txt in os.listdir(workingDirectory): # Iterate through text filess in workingDirectory
print("Processing File: " + str(txt))
fileExtension = txt.split(".")[-1]
if fileExtension == "txt":
textFilename = workingDirectory + txt # Becomes: \PATH\example.text
f = open(textFilename,"r")
data = f.read() # read what is inside
print data # print to show it is readable
#RegEx goes here?
i += 1 # counter
print("Successfully read " + str(i) + " files.")
def main(argv):
workingDirectory = "../Documents/folder//" # Put your source directory of text files here
outputDirectory = "../Documents//" # Where you want your converted files to go.
convertText(workingDirectory, outputDirectory)
if __name__ == "__main__":
main(sys.argv[1:])
我想我可能需要RegEx才能解析文件?你会推荐什么?
如果它更有意义,我不反对使用R或其他东西。
谢谢。
答案 0 :(得分:1)
正则表达式应该足以满足您的使用案例。使用正则表达式r"\sTarget:(.*)
,您可以匹配与Target:
匹配的行上的所有内容,然后通过创建您希望匹配并迭代它们的所有字段的列表,构建一个字典对象,存储每个字段的值。
使用Python CSV library ,您可以创建CSV文件,并为目录中的每个.txt
文件推送一行匹配的词典字段writer.writerow({'Target':'','Attribute':'','Time':'','Filename':'','Label':''})
示例:强>
import os
import sys
import re
import csv
# Take all text files in workingDirectory and put them into a DF.
def convertText(workingDirectory, outputDirectory):
with open(outputDirectory+'emails.csv', 'w') as csvfile: # opens the file \PATH\emails.csv
fields = ['Target','Attribute','Label','Time','Full Text'] # fields you're searching for with regex
csvfield = ['Target','Attribute','Label','Time','Full Text','Filename'] # You want to include the file name in the csv header but not find it with regex
writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames=fields)
writer.writeheader() # writes the csvfields list to the header of the csv
if workingDirectory == "": workingDirectory = os.getcwd() + "\\" # Returns current working directory, if workingDirectory is empty.
i = 0
for txt in os.listdir(workingDirectory): # Iterate through text filess in workingDirectory
print("Processing File: " + str(txt))
fileExtension = txt.split(".")[-1]
if fileExtension == "txt":
textFilename = workingDirectory + txt # Becomes: \PATH\example.text
f = open(textFilename,"r")
data = f.read() # read what is inside
#print(data) # print to show it is readable
fieldmatches = {}
for field in fields:
regex = "\\s" + field + ":(.*)" # iterates through each of the fields and matches using r"\sTarget:(.*) that selects everything on the line that matches with Target:
match = re.search(regex, data)
if match:
fieldmatches[field] = match.group(1)
writer.writerow(fieldmatches) # for each file creates a dict of fields and their values and then adds that row to the csv
i += 1 # counter
print("Successfully read " + str(i) + " files.")
def main(argv):
workingDirectory = "../Documents/folder//" # Put your source directory of text files here
outputDirectory = "../Documents//" # Where you want your converted files to go.
convertText(workingDirectory, outputDirectory)
if __name__ == "__main__":
main(sys.argv[1:])
对于处理文件,这应该在我的机器上足够快,花费不到一秒
Successfully read 1866 files.
Time: 0.6991933065852838
希望这有帮助!