从python中的文本中删除无效符号

时间:2013-11-20 20:05:38

标签: python string python-3.x

我正在尝试从文本中删除无效符号。我有这段代码:

def parse_documentation(filename):
    filename=open(filename)
    invalidsymbols=["`","~","!", "@","#","$"]
    for lines in filename:
        print(lines)
        for word in lines:
        print(word)
            for letter in word:
                if invalidsymbols==letter:
                    print(letter)

首先我只是通过打印这封信来测试它,然后我会添加代码来删除它(del())。我有比列表中的符号更多的无效符号但它很多,所以我想检查使用只有5或6.我遇到的问题是它不仅打印无效符号,而且打印我文本中的所有字母。此外,由于某种原因,它也会在我的文本之前打印额外的字符。我该如何解决这个问题?

我正在使用的文字是:

he's a jolly good fellow#
I want pizza!
I'm driving to school$

4 个答案:

答案 0 :(得分:3)

您可以使用str.translate一起删除不需要的符号:

>>> txt = """he's a jolly good fellow#
... I want pizza!
... I'm driving to school$"""
>>> print txt.translate(None, "`~!@#$")
he's a jolly good fellow
I want pizza
I'm driving to school

因此您的代码可能类似于

def parse_documentation(filename, invalid_symbols):
    symb_to_remove = ''.join(invalid_symbols)
    with open(filename, 'rb') as in_file:
        for line in in_file:
            safe_line = line.translate(None, symb_to_remove)
            <here comes code to do smthng with safe_line>

您将使用

调用此函数
parse_documentation(filename, ["`","~","!", "@","#","$"])

答案 1 :(得分:0)

def parse_documentation(filename):
    filename=open(filename, "r") # open file
    lines = filename.read(); # read all the lines in the file to a list named as "lines"
    invalidsymbols=["`","~","!", "@","#","$"]
    for line in lines: # for each line in lines
        for x in invalidsymbols: # loop through the list of invalid symbols
            if x in line: # if the invalid symbols is in the line
                print(line) # print out the line
                print(x) # and also print out the invalid symbol you encountered in that line
                print(line.replace(x, "")) # print out a line with invalid symbol removed

那怎么样?

答案 2 :(得分:0)

JoeC已经回答了,但我想补充一点,如果您的无效符号在该行中多次出现,那么您可能最好不要执行以下操作

def parse_documentation(filename):
    filelines = open(filename)
    invalidsymbols=["`","~","!", "@","#","$"]
    for line in filelines:
        print(lines)
        for symbol in invalidsymbols:
            if symbol in line:
                print("Above line contains %s symbol"%symbol)

关于替换符号,请参阅JoeC's answer

答案 3 :(得分:0)

尝试使用 textcleaner 库执行此任务。
请通过以下链接获取首页和文档:https://pypi.org/project/textcleaner/
调用remove_symbols函数,它将返回纯文本。它仅使用正则表达式。
功能说明: https://yugantm.github.io/textcleaner/documentation.html#remove_symbols