Question

如何更改html中的某些单词但保存所有标记和标记而不进行更改？例如：将“门”更改为“汽车”，将“任何”更改为“每个”

tf = open(html_file)
text = tf.read()
tws = text.split( )

原文为：

the doors.</p>
<p>“Any man

结果应为：

the cars.</p>
<p>“Every man

它解析如下：

the
doors.</p>
<p>“Any
man

这样更好：

the
doors.
</p>
<p>
“Any
man

我认为，最好的方法是用词语分开：

the
doors
.
</p>
<p>
“
Any
man

Answer 1

使用beautifulsoup将HTML解析为树，遍历树，为每个文本节点替换其中的单词。

以下是代码：

EntityInsertAction

方法replace_words仅适用于带句点/逗号/分号的简单句子。您可以编写更强大的标记生成器来分割单词，或者在nltk中使用类似from bs4 import BeautifulSoup from bs4.element import NavigableString import re replace_dict = {'doors':'cars', 'any':'every'} def replace_words(s): global replace_dict words = re.compile("([,\.;\s]\s*)").split(s) print words new_words = [replace_dict[word] if word in replace_dict else word for word in words] return ''.join(new_words) def traverse(soup): for section in soup.contents: if isinstance(section, NavigableString): print "Find a text string:"+ str(section) newstr = replace_words(str(section.string)) print "New string:"+ newstr soup.string.replace_with(newstr) else: traverse(section) html = "<html><body><a>Open doors.</a><div>any element should be replaced.</div></body></html>" soup = BeautifulSoup(html) traverse(soup) print soup >>> <html><body><a>Open cars.</a><div>every element should be replaced.</div></body></html>的内容。

如何仅更改html中的内容并保持所有标记不变

1 个答案: