使用拆分方法预处理文本文件中的数据

时间:2018-12-18 03:17:49

标签: python list split python-2.x

我在下面的文本中写了一个例子。我想要的是将此文本附加到python中的列表数据结构中。我首先使用'<EOS>'作为分隔符来分割此文本。然后将split方法结果的每个元素附加到列表数据类型中。

但是我面对的是split方法使用'\n''<EOS>'作为分隔符来分割文本。因此,现在将单行添加到列表数据类型,而不是完整部分。

请仔细阅读下面的示例文本后面的代码,让我知道我在做错什么。

Old Major, the old boar on the Manor Farm, summons the animals on the farm together for a meeting, during which he refers to humans as "enemies" and teaches the animals a revolutionary song called "Beasts of England".
When Major dies, two young pigs, Snowball and Napoleon, assume command and consider it a duty to prepare for the Rebellion.<EOS>
Alex is a 15-year-old living in near-future dystopian England who leads his gang on a night of opportunistic, random "ultra-violence".
Alex's friends ("droogs" in the novel's Anglo-Russian slang, 'Nadsat') are Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and Pete, who mostly plays along as the droogs indulge their taste for ultra-violence.
Characterised as a sociopath and a hardened juvenile delinquent, Alex also displays intelligence, quick wit, and a predilection for classical music; he is particularly fond of Beethoven, referred to as "Lovely Ludwig Van".`

Python代码将文档读入列表类型:

f=open('./plots')
documents=[]
for x in f:
    documents.append(x.split('<EOS>'))
print documents[0]

#documents[0] must start from 'Old Major' and stops at 'Rebellion'.

3 个答案:

答案 0 :(得分:1)

在f上循环将导致文件内容由换行符分隔。改用它:

f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]

答案 1 :(得分:1)

split('<EOS>')仅在<EOS>上按您的期望进行拆分。但是,for x in f:可以逐行工作,因此可以有效地对文件执行隐式split

相反,也许做这样的事情:

f=open('./plots')
documents=f.read().split('<EOS>')
print documents[0]

答案 2 :(得分:1)

The scriptArgument can only be used with command; Using it with scriptUri causes an error.不会用split() '\n'分割文本,只是针对后者。 '<EOS>'可以通过换行符(for x in f:)有效地分割文件内容。

以下代码与您的代码大致相同,可以说明发生了什么事:

\n