Question

我想制作程序，它将拆分txt文件中的每个单词，以及单词的返回列表，但不重复任何单词。我将我的pdf书改为txt然后使用了我的程序，但它完全失败了。我不知道，我做错了什么。这是我的代码：

def split(file):
    lines = open(file, 'rU').readlines()
    words = []
    word = ''
    for line in lines:
        for letter in line:
            if letter not in [' ', '\n', '.', ',']:
                word += letter
            elif letter in [' ', '\n', '.', ',']:
                if word not in words:
                    words.append(word)
                    word = ''

    words.sort()
    return words


for word in split('AKiss.txt'):
    print(word, end=' ')

我还附上了AKiss.txt和原始PDF，以防它有用。

PDF - http://1drv.ms/b/s!AtZrd19H_8oyabhAx-NZvIQD_Ug

TXT - http://1drv.ms/t/s!AtZrd19H_8oyapvBvAo27rNJSwQ

Answer 1

您可能希望采用不同的方式：

def split_file(file):
    all_words = set()
    for ln in open(file, 'rU').readlines():
        words = ln.strip().split()

        dot_split = []
        for w in words:
            dot_split.extend(w.split('.'))
        comma_split = []
        for w in dot_split:
            comma_split.extend(w.split(','))

        all_words = all_words.union(set(comma_split))

    print(sorted(all_words))

split_file('test_file.txt')

或更简单，使用正则表达式：

import re

def split_file2(file):
    all_words2 = set()
    for ln in open(file, 'rU').readlines():
        words2 = re.split('[ \t\n\.,]', ln.strip())  # note the escaped '.'!
        all_words2 = all_words2.union(set(words2))
    print(sorted(all_words))

作为旁注，我不会使用split作为函数名，因为它隐藏了您可能希望在标准库/ string库中使用的函数。

Answer 2

你可以试试这个：

import itertools
words = list(set(itertools.chain.from_iterable([[''.join(c for c in b if c.isalpha()) for b in i.strip('\n').split()] for i in open('filename.txt') if i != "\n"])))

Answer 3

使用strip()和split()方法可以为您提供帮助。

Python - 在txt中拆分单词

3 个答案: