Question

我正在尝试在xml结构中构造我的文本文档，其中每个句子都有一个id。我有非结构化句子的文本文档，我想用'。'来分割句子。分隔符并将它们写入xml。这是我的代码：

    import re

    #Read the file
    with open ('C:\\Users\\ngwak\\Documents\\test.txt') as f:
        content = [f]
        split_content = []
        for element in content:
            split_content += re.split("(.)\s+", element)

        print(split_content, sep='\n\n')

但我已经收到此错误，我无法解释它：

    TypeError: expected string or buffer

如何分割我的句子并将其写入xml？非常感谢。这就是我的txt文件的样子：

从正式意义上说，民族意识的萌芽可以追溯到1858年6月13日在士兵之间签署的“和平条约”，除了Bondelswarts之外的所有酋长（没有参与过战斗，以及Muewuta，两个儿子的amuaha，以前是Triku人的Onag酋长的指挥官。这种观点有充足的书信和口头证据。最尖锐的声明可以在1890年5月13日写的Onag to Bonagha现在着名且经常引用的一封信中找到，其中，除其他外，他说6月13日有人来。再次在2015年2月1日至01.05，有一些即将到来。

我希望xml中的句子是这样的：

    <sentence id=01>In a formal sense, the germ of national consciousness 
    can be traced back to the Peace Treaty of Hoachanas signed in 13–June-
    1858 between soldiers, all the  chiefs except those of the Bondelswarts 
    (who had not been involved in the previous fighting), as well as by 
    Muewuta, two sons of  amuaha, formerly a Commandant of Chief Onag of the 
    Triku people. </sentence>

Answer 1

text_file = open('C:\\Users\\ngwak\\Documents\\test.txt', "r")
textLinesFromFile = text_file.read().replace("\n","").split('.')

for sentenceNumber in range (0,len(textLinesFromFile)):
    print (textLinesFromFile[sentenceNumber].strip())
    #Or write each sentence in your XML

Answer 2

您不需要content = [f]行。

with open ('C:\\Users\\ngwak\\Documents\\test.txt') as file:
    split_content = []
    for element in file:
        split_content += re.split("(.)\s+", element)

    print(split_content, sep='\n\n')

文件对象是可迭代的。在for循环中使用它们将遍历每一行。

进一步阅读

Methods on File objects
此SO答案中的示例：Iterating on a file using Python

如何将文本拆分成句子并将其写入xml

2 个答案: