Python - 无法将txt文件中的行拆分为单词

时间:2013-11-19 19:25:31

标签: python list file-io split

我的目标是打开一个文件并将其拆分为独特的单词并显示该列表(以及数字计数)。我想我必须将文件拆分成行,然后将这些行拆分为单词并将其全部添加到列表中。

问题是如果我的程序将在无限循环中运行而不显示任何结果,或者它只会读取一行然后停止。正在阅读的文件是葛底斯堡地址。

def uniquify( splitz, uniqueWords, lineNum ):
for word in splitz:
    word = word.lower()        
    if word not in uniqueWords:
        uniqueWords.append( word )

def conjunctionFunction():

    uniqueWords = []

    with open(r'C:\Users\Alex\Desktop\Address.txt') as f :
        getty = [line.rstrip('\n') for line in f]
    lineNum = 0
    lines = getty[lineNum]
    getty.append("\n")
    while lineNum < 20 :
        splitz = lines.split()
        lineNum += 1

        uniquify( splitz, uniqueWords, lineNum )
    print( uniqueWords )


conjunctionFunction()

5 个答案:

答案 0 :(得分:3)

使用当前代码,行:

lines = getty[lineNum]

应该在while循环中移动。

答案 1 :(得分:1)

你弄清楚你的代码有什么问题,但是,我会做的略有不同。由于您需要跟踪唯一单词的数量及其计数,因此您应该使用字典来执行此任务:

wordHash = {}

with open('C:\Users\Alex\Desktop\Address.txt', 'r') as f :
    for line in f:
       line = line.rstrip().lower()

       for word in line:
            if word not in wordHash:
                wordHash[word] = 1

            else: 
                wordHash[word] += 1

print wordHash

答案 2 :(得分:0)

def splitData(filename):
    return [words for words in open(filename).reads().split()]

将文件拆分为单词的最简单方法:)

答案 3 :(得分:0)

假设从文件中检索inp

inp = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense."""


data = inp.splitlines()

print data

_d = {}

for line in data:
    word_lst = line.split()
    for word in word_lst:
        if word in _d:
            _d[word] += 1
        else:
            _d[word] = 1

print _d.keys()

输出

['Beautiful', 'Flat', 'Simple', 'is', 'dense.', 'Explicit', 'better', 'nested.', 'Complex', 'ugly.', 'Sparse', 'implicit.', 'complex.', 'than', 'complicated.']

答案 4 :(得分:0)

我建议:

#!/usr/local/cpython-3.3/bin/python

import pprint
import collections

def genwords(file_):
    for line in file_:
        for word in line.split():
            yield word

def main():
    with open('gettysburg.txt', 'r') as file_:
        result = collections.Counter(genwords(file_))

    pprint.pprint(result)

main()

...但您可以使用re.findall更好地处理标点符号,而不是使用string.split。