如何根据python中的空格将文本文件拆分成多个列表?

时间:2015-01-19 05:10:38

标签: python list split tokenize

我是python编程的新手,请帮我创建一个以文本文件作为参数的函数,并创建一个单词列表,从而删除所有标点符号,并在双空格上“拆分”列表。我的意思是列表应该在文本文件中的每个双空格出现次数上创建存在。

这是我的功能:

def tokenize(document):
    file = open("document.txt","r+").read()
    print re.findall(r'\w+', file)

输入文本文件的字符串如下:

What's did the little boy tell the game warden?     His dad was in the kitchen poaching eggs!

注意:监狱长后有两个间距?在他之前

我的功能给了我这样的输出

['what','s','did','the','little','boy','tell','the','game','warden','His','dad','was','in','the','kitchen','poaching','eggs']

期望的输出:

[['what','s','did','the','little','boy','tell','the','game','warden'],
['His','dad','was','in','the','kitchen','poaching','eggs']]

4 个答案:

答案 0 :(得分:0)

首先用标点符号分割文件,然后在第二遍用空格分割结果字符串。

def splitByPunct(s):
    return (x.group(0) for x in  re.finditer(r'[^\.\,\?\!]+', s) if x and x.group(0))

[x.split() for x in splitByPunct("some string, another   string! The phrase")]

这会产生

[['some', 'string'], ['another', 'string'], ['The', 'phrase']]

答案 1 :(得分:0)

首先split 双倍空格上的整个文字,然后将每个项目传递给regex

>>> file = "What's did the little boy tell the game warden?  His dad was in the kitchen poaching eggs!"
>>> file = text.split('  ')
>>> file
["What's did the little boy tell the game warden?", 'His dad was in the kitchen poaching eggs!']
>>> res = []
>>> for sen in file:
...    res.append(re.findall(r'\w+', sen))
... 
>>> res
[['What', 's', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden'], ['His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']]

答案 2 :(得分:0)

这是一种合理的所有RE方法:

def tokenize(document):
    with open("document.txt") as f:
        text = f.read()
    blocks = re.split(r'\s\s+', text)
    return [re.findall(r'\w+', b) for b in blocks]

答案 3 :(得分:0)

内置分割功能允许在多个空格上分割。

此:

a = "hello world.  How are you"
b = a.split('  ')
c = [ x.split(' ') for x in b ]

收率:

c = [['hello', 'world.'], ['how', 'are', 'you?']]

如果您也想删除标点符号,请将正则表达式应用于“b”中的元素或第三个语句中的“x”。