我使用以下代码从文件中删除所有HTML标记并将其转换为纯文本。此外,我必须将XML / HTML字符转换为ASCII字符。在这里,我有21行读取整个文本。这意味着如果我想转换一个巨大的文件,我必须花费大量的资源来做这件事。
您是否有任何想法可以提高代码效率并提高速度,同时减少资源的使用量?
# -*- coding: utf-8 -*-
import re
# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()
# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('‘',"""'""")
temp = temp.replace ('’',"""'""")
temp = temp.replace ('“',"""\"""")
temp = temp.replace ('”',"""\"""")
temp = temp.replace ('‚',""",""")
temp = temp.replace ('′',"""'""")
temp = temp.replace ('″',"""\"""")
temp = temp.replace ('«',"""«""")
temp = temp.replace ('»',"""»""")
temp = temp.replace ('‹',"""‹""")
temp = temp.replace ('›',"""›""")
temp = temp.replace ('&',"""&""")
temp = temp.replace ('–',""" – """)
temp = temp.replace ('—',""" — """)
temp = temp.replace ('®',"""®""")
temp = temp.replace ('©',"""©""")
temp = temp.replace ('™',"""™""")
temp = temp.replace ('¶',"""¶""")
temp = temp.replace ('•',"""•""")
temp = temp.replace ('·',"""·""")
# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)
# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()
答案 0 :(得分:1)
你可以使用string.translate()
from string import maketrans # Required to call maketrans function.
intab = "string of original characters that need to be replaced"
outtab = "string of new characters"
trantab = maketrans(intab, outtab)# maketrans() is helper function in the string module to create a translation table
str = "this is string example....wow!!!";#you string
print str.translate(trantab);
请注意,在python3中,str.translate将比在python2中慢得多,特别是如果你只翻译几个字符。这是因为它必须处理unicode字符,因此使用dict来执行翻译而不是索引字符串。
答案 1 :(得分:1)
我的第一直觉是string.translate()
与string.maketrans()
结合使用这只会传递一次而不是几次。每次调用str.replace()
都会自行传递整个字符串,您希望避免这种情况。
一个例子:
from string import ascii_lowercase, maketrans, translate
from_str = ascii_lowercase
to_str = from_str[-1]+from_str[0:-1]
foo = 'the quick brown fox jumps over the lazy dog.'
bar = translate(foo, maketrans(from_str, to_str))
print bar # sgd pthbj aqnvm enw itlor nudq sgd kzyx cnf.
答案 2 :(得分:1)
使用sting.tranlate()
或string.maketran()
的问题是,当我使用它时,我必须将A char分配给另一个。 e.g。
print string.maketran("abc","123")
但是,我需要在ASCII中为单引号(‘
)分配像'
这样的HTML / XML字符。这意味着我必须使用以下代码:
print string.maketran("‘","'")
它面临以下错误:
ValueError:maketrans参数必须具有相同的长度
然而,如果我使用HTMLParser,它会将所有HTML / XML转换为ASCII而不会出现上述问题。我还添加了encode('utf-8')
来解决以下错误:
UnicodeEncodeError:'ascii'编解码器无法对字符u'\ u201c'进行编码 位置246:序数不在范围内(128)
# -*- coding: utf-8 -*-
import re
from HTMLParser import HTMLParser
# This file contains HTML.
file = open('input-file.txt', 'r')
temp = file.read()
# Replace all XML/HTML characters to ASCII ones.
temp = HTMLParser.unescape.__func__(HTMLParser, temp)
# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
# Encode the text to UTF-8 for preventing some errors.
result = result.encode('utf-8')
print(result)
# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()