Question

我有点坚持如何继续下去，所以一点点推动会非常有帮助。

我有~1800个文本文件，实际上是电子邮件，是重复格式。

每个文件的结构如下：

From: Person-1 [email@person-1.com]
Sent: Tuesday, April 18, 2017 11:24 AM
To: email@person-2.com
Subject: Important Subject

User, 

Below is your search alert.

Target: text

Attribute: text

Label: abcdef

Time: Apr 18, 2017 11:24 EDT

Full Text: Text of various length exists here. Some files even have links. I'm not sure how I would capture a varied length field.

Recording: abcde & fghijk lmnop

这是它的要点。

我想将其写入DF，我可以存储为CSV格式。

我想结束这样的事情？

| Target | Attribute |  Label  |  Time  |  Full Text  | Recording | Filename |
|--------|-----------|---------|--------|-------------|-----------|----------|
|    text|       text|   abcdef| (date) |(Full text..)|abcde & f..| 1111.txt |
|   text2|      text2|  abcdef2| (date) |(Full text..)|abcde & f..| 1112.txt |

第二行是另一个文本文件。

我有代码遍历所有文本文件并打印出来。这是代码：

# -*- coding: utf-8 -*-
import os
import sys

# Take all text files in workingDirectory and put them into a DF.
def convertText(workingDirectory, outputDirectory):
    if workingDirectory == "": workingDirectory = os.getcwd() + "\\" # Returns current working directory, if workingDirectory is empty.
    i = 0
    for txt in os.listdir(workingDirectory): # Iterate through text filess in workingDirectory
        print("Processing File: " + str(txt))
        fileExtension = txt.split(".")[-1]
        if fileExtension == "txt":
            textFilename = workingDirectory + txt # Becomes: \PATH\example.text
            f = open(textFilename,"r")
            data = f.read() # read what is inside
            print data # print to show it is readable

            #RegEx goes here?

            i += 1 # counter
    print("Successfully read " + str(i) + " files.")


def main(argv):
    workingDirectory = "../Documents/folder//" # Put your source directory of text files here
    outputDirectory = "../Documents//" # Where you want your converted files to go.

    convertText(workingDirectory, outputDirectory)

if __name__ == "__main__":
    main(sys.argv[1:])

我想我可能需要RegEx才能解析文件？你会推荐什么？

如果它更有意义，我不反对使用R或其他东西。

谢谢。

Answer 1

正则表达式应该足以满足您的使用案例。使用正则表达式r"\sTarget:(.*)，您可以匹配与Target:匹配的行上的所有内容，然后通过创建您希望匹配并迭代它们的所有字段的列表，构建一个字典对象，存储每个字段的值。

使用Python CSV library ，您可以创建CSV文件，并为目录中的每个.txt文件推送一行匹配的词典字段writer.writerow({'Target':'','Attribute':'','Time':'','Filename':'','Label':''})

示例：

import os import sys import re import csv # Take all text files in workingDirectory and put them into a DF. def convertText(workingDirectory, outputDirectory): with open(outputDirectory+'emails.csv', 'w') as csvfile: # opens the file \PATH\emails.csv fields = ['Target','Attribute','Label','Time','Full Text'] # fields you're searching for with regex csvfield = ['Target','Attribute','Label','Time','Full Text','Filename'] # You want to include the file name in the csv header but not find it with regex writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames=fields) writer.writeheader() # writes the csvfields list to the header of the csv if workingDirectory == "": workingDirectory = os.getcwd() + "\\" # Returns current working directory, if workingDirectory is empty. i = 0 for txt in os.listdir(workingDirectory): # Iterate through text filess in workingDirectory print("Processing File: " + str(txt)) fileExtension = txt.split(".")[-1] if fileExtension == "txt": textFilename = workingDirectory + txt # Becomes: \PATH\example.text f = open(textFilename,"r") data = f.read() # read what is inside #print(data) # print to show it is readable fieldmatches = {} for field in fields: regex = "\\s" + field + ":(.*)" # iterates through each of the fields and matches using r"\sTarget:(.*) that selects everything on the line that matches with Target: match = re.search(regex, data) if match: fieldmatches[field] = match.group(1) writer.writerow(fieldmatches) # for each file creates a dict of fields and their values and then adds that row to the csv i += 1 # counter print("Successfully read " + str(i) + " files.") def main(argv): workingDirectory = "../Documents/folder//" # Put your source directory of text files here outputDirectory = "../Documents//" # Where you want your converted files to go. convertText(workingDirectory, outputDirectory) if __name__ == "__main__": main(sys.argv[1:])

对于处理文件，这应该在我的机器上足够快，花费不到一秒

Successfully read 1866 files. Time: 0.6991933065852838

希望这有帮助！

Python：多个文本文件到Dataframe

1 个答案: