如何更改html中的某些单词但保存所有标记和标记而不进行更改? 例如:将“门”更改为“汽车”,将“任何”更改为“每个”
tf = open(html_file)
text = tf.read()
tws = text.split( )
原文为:
the doors.</p>
<p>“Any man
结果应为:
the cars.</p>
<p>“Every man
它解析如下:
the
doors.</p>
<p>“Any
man
这样更好:
the
doors.
</p>
<p>
“Any
man
我认为,最好的方法是用词语分开:
the
doors
.
</p>
<p>
“
Any
man
答案 0 :(得分:0)
使用beautifulsoup将HTML解析为树,遍历树,为每个文本节点替换其中的单词。
以下是代码:
EntityInsertAction
方法replace_words仅适用于带句点/逗号/分号的简单句子。您可以编写更强大的标记生成器来分割单词,或者在nltk中使用类似from bs4 import BeautifulSoup
from bs4.element import NavigableString
import re
replace_dict = {'doors':'cars', 'any':'every'}
def replace_words(s):
global replace_dict
words = re.compile("([,\.;\s]\s*)").split(s)
print words
new_words = [replace_dict[word] if word in replace_dict else word for word in words]
return ''.join(new_words)
def traverse(soup):
for section in soup.contents:
if isinstance(section, NavigableString):
print "Find a text string:"+ str(section)
newstr = replace_words(str(section.string))
print "New string:"+ newstr
soup.string.replace_with(newstr)
else:
traverse(section)
html = "<html><body><a>Open doors.</a><div>any element should be replaced.</div></body></html>"
soup = BeautifulSoup(html)
traverse(soup)
print soup
>>> <html><body><a>Open cars.</a><div>every element should be replaced.</div></body></html>
的内容。