从文本文件中的单词中删除特定的标点符号

时间:2018-10-19 06:10:06

标签: python python-3.x

我需要编写一个函数, get_words_from_file(filename),该函数返回小写单词的列表。您的函数应仅处理起点和终点标记线之间的线。单词的顺序应与文件中出现的顺序相同。 这是一个示例文本文件:baboosh.txt:

*** START OF TEST CASE ***
......list of sentences here.....
*** END OF TEST CASE ***
This is after the end and should be ignored too.

这是我想出的:

import re
from string import punctuation

def stripped_lines(lines):
    for line in lines:
        stripped_line = line.rstrip('\n')
        yield stripped_line


def lines_from_file(fname):
    with open(fname, 'rt') as flines:
        for line in stripped_lines(flines):
            yield line


def is_marker_line(line, start='***', end='***'):
    '''
    Marker lines start and end with the given strings, which may not
    overlap. (A line containing just '***' is not a valid marker line.)
    '''
    min_len = len(start) + len(end)
    if len(line) < min_len:
        return False
    return line.startswith(start) and line.endswith(end)

def advance_past_next_marker(lines):
    '''
    '''
    for line in lines:
        if is_marker_line(line):
            break


def lines_before_next_marker(lines):

    valid_lines = []
    for line in lines:
        if is_marker_line(line):
            break
        line.replace('"', '')
        valid_lines.append(line)


    for content_line in valid_lines:
        yield content_line


def lines_between_markers(lines):
    '''
    Yields the lines between the first two marker lines.
    '''
    it = iter(lines)
    advance_past_next_marker(it)
    for line in lines_before_next_marker(it):
        yield line


def words(lines):
    text = '\n'.join(lines).lower().split()
    return text


def get_words_from_file(fname):
    return words(lines_between_markers(lines_from_file(fname)))

#This is the test code that must be executed
filename = "baboosh.txt"
words = get_words_from_file(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(words)))
print("Valid word list:")
for word in words:
    print(word)
  

我的输出

我得到正确的单词列表。但是在打印时,我会出现标点符号,例如冒号,分号和句号。我不知道还有什么办法摆脱这些。

我该怎么做?

1 个答案:

答案 0 :(得分:1)

使用(?:[+\-*@&/%^|^]|([*/><])\1)= 代替re.split。如果您像这样设置编译后的正则表达式:

str.split

然后您可以使用以下方法拆分行:

splitter = re.compile('[ ;:".]')

这将返回不带标点的单词。