Python - 在txt中拆分单词

时间:2017-10-17 19:48:27

标签: python file

我想制作程序,它将拆分txt文件中的每个单词,以及单词的返回列表,但不重复任何单词。我将我的pdf书改为txt然后使用了我的程序,但它完全失败了。我不知道,我做错了什么。这是我的代码:

def split(file):
    lines = open(file, 'rU').readlines()
    words = []
    word = ''
    for line in lines:
        for letter in line:
            if letter not in [' ', '\n', '.', ',']:
                word += letter
            elif letter in [' ', '\n', '.', ',']:
                if word not in words:
                    words.append(word)
                    word = ''

    words.sort()
    return words


for word in split('AKiss.txt'):
    print(word, end=' ')

我还附上了AKiss.txt和原始PDF,以防它有用。

PDF - http://1drv.ms/b/s!AtZrd19H_8oyabhAx-NZvIQD_Ug

TXT - http://1drv.ms/t/s!AtZrd19H_8oyapvBvAo27rNJSwQ

3 个答案:

答案 0 :(得分:1)

您可能希望采用不同的方式:

def split_file(file):
    all_words = set()
    for ln in open(file, 'rU').readlines():
        words = ln.strip().split()

        dot_split = []
        for w in words:
            dot_split.extend(w.split('.'))
        comma_split = []
        for w in dot_split:
            comma_split.extend(w.split(','))

        all_words = all_words.union(set(comma_split))

    print(sorted(all_words))

split_file('test_file.txt')

或更简单,使用正则表达式:

import re

def split_file2(file):
    all_words2 = set()
    for ln in open(file, 'rU').readlines():
        words2 = re.split('[ \t\n\.,]', ln.strip())  # note the escaped '.'!
        all_words2 = all_words2.union(set(words2))
    print(sorted(all_words))

作为旁注,我不会使用split作为函数名,因为它隐藏了您可能希望在标准库/ string库中使用的函数。

答案 1 :(得分:1)

你可以试试这个:

import itertools
words = list(set(itertools.chain.from_iterable([[''.join(c for c in b if c.isalpha()) for b in i.strip('\n').split()] for i in open('filename.txt') if i != "\n"])))

答案 2 :(得分:0)

使用strip()split()方法可以为您提供帮助。