这是我的html
:
<html>
<body>
<h2>Pizza</h2>
<p>This is some random paragraph without child tags.</p>
<p>Delicious homebaked pizza.<br><em></em>$8.99 pp</em></p>
<h2>Eggplant Parmesan</h2>
<p>Try the authentic <i>Italian flavor</i> of baked aubergine.<br><em>$6.99 pp</em></p>
<h2>Italian Ice Cream</h2>
<p>Our dessert specialty.<br><em>$3.99 pp</em></p>
</body>
</html>
使用BeautifulSoup,我要获取h2
和p
标签显示的文本,将其替换为树中的前缀版本,并在屏幕上打印出来。对于h2
标签,这可以正常工作:
from bs4 import BeautifulSoup
with open("/var/www/html/Test/index.html", "r") as f:
soup = BeautifulSoup(f, "lxml")
f = open("/var/www/html/Test/I18N_index.html", "w+")
for h2 in soup.find_all('h2'):
i18n_string = "I18N_"+h2.string
h2.string.replace_with(i18n_string)
print(h2.string)
f.write(str(soup))
###Output:##############################################
# $ python ./test.py
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
########################################################
在我的I18N_index.html中,所有3个字符串都正确地以'I18N_'为前缀。
但是,我的p
标签包含子标签,对于这些标签,返回类型为'None'。结果,串联不再起作用:
for p in soup.find_all('p'):
i18n_string = "I18N_"+p.string
p.string.replace_with(i18n_string)
print(p.string)
f.write(str(soup))
###Output:##################################################
# $ python ./test.py
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# I18N_This is some random paragraph without child tags.
# Traceback (most recent call last):
# File "./test.py", line 15, in <module>
# i18n_string = "I18N_"+p.string
# TypeError: cannot concatenate 'str' and 'NoneType' objects
############################################################
我从this thread了解了join
函数。让我进行串联并在屏幕上打印出结果字符串,而不是汤树中的替换字符串:
for p in soup.find_all('p'):
joined = ''.join(p.strings)
i18n_string = "I18N_"+joined
#joined.replace_with(i18n_string)
print (i18n_string)
###Output with 'joined.replace_with(i18n_string)' DISABLED:###
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# I18N_This is some random paragraph without child tags.
# I18N_Delicious homebaked pizza.$8.99 pp
# I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
# I18N_Our dessert specialty$3.99 pp
############################################################
###Output with 'joined.replace_with(i18n_string)' ENABLED:#####
# I18N_Pizza
# I18N_Eggplant Parmesan
# I18N_Italian Ice Cream
# Traceback (most recent call last):
# File "./test.py", line 41, in <module>
# joined.replace_with(i18n_string)
# AttributeError: 'unicode' object has no attribute 'replace_with'
############################################################
在该线程中,提到了另一种基于isinstance
的解决方案,但我无法做到这一点。
如果我理解正确,那么join函数会联接字符串,但返回一个“ unicode”对象,而不是一个字符串对象,这就是为什么“ replace_with”属性不起作用的原因。我该如何解决?非常感谢您的帮助。
答案 0 :(得分:2)
replace_with()
方法不起作用不是因为joined
是unicode对象,而是因为它是bs4对象专用的方法。看到这个:BeautifulSoup-replace_with
通过join()
方法返回str
的方式,请参见:python3-join
现在为您提供解决方案,我只需在string
标记后删除p
:
from bs4 import BeautifulSoup
with open("index.html", "r") as f:
soup = BeautifulSoup(f, "lxml")
f = open("I18N_index.html", "w+")
for h2 in soup.find_all('h2'):
i18n_string = "I18N_"+h2.string
h2.string.replace_with(i18n_string)
print(h2.string)
for p in soup.find_all('p'):
joined = ''.join(p.strings)
i18n_string = "I18N_"+joined
p.replace_with(i18n_string)
print (i18n_string)
f.write(str(soup))
输出:
I18N_Pizza
I18N_Eggplant Parmesan
I18N_Italian Ice Cream
I18N_This is some random paragraph without child tags.
I18N_Delicious homebaked pizza.$8.99 pp
I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
I18N_Our dessert specialty.$3.99 pp
答案 1 :(得分:1)
使用简化的代码版本(即仅解决p
标签问题),您似乎必须将p.string
替换为p.text
:
soup = BeautifulSoup([您的html],“ lxml”)
for p in soup.find_all('p'):
print('before: ',p.text)
i18n_string = "I18N_"+p.text
print('after ',i18n_string)
输出:
before: This is some random paragraph without child tags.
after I18N_This is some random paragraph without child tags.
before: Delicious homebaked pizza.$8.99 pp
after I18N_Delicious homebaked pizza.$8.99 pp
before: Try the authentic Italian flavor of baked aubergine.$6.99 pp
after I18N_Try the authentic Italian flavor of baked aubergine.$6.99 pp
before: Our dessert specialty.$3.99 pp
after I18N_Our dessert specialty.$3.99 pp