嗨,我想在标签之间抓取。在下面,我附上我想抓取的部分来源。如果仔细看,有3个ul标签。第一个ul标签具有class =“ listGroup”。我试图提取第二个“ ul”标签的文本,然后再提取另一个具有“ listGroup”类的“ ul”标签。请分享我该怎么做。
<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>
我附上我到目前为止做的简短脚本。请帮忙。
import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
page = requests.get(f"https://www.mayoclinic.org{link}")
soup = BeautifulSoup(page.content, "html.parser")
for each in soup.find_all("ul"):
print(each)
答案 0 :(得分:1)
这似乎是CSS选择器的自然用例,即:
ul.listGroup + ul li
将在第一个li
标签之后的所有ul
标签中选择所有ul
标签,该标签紧随每个listGroup
标签,类别为+
。用~
代替li
会选择所有ul
标签(在本例中为2个)中的所有listGroup
标签,这些标签紧随每个标签之后,类别为find_all
要在脚本中实现此答案,请将select
替换为import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
page = requests.get(f"https://www.mayoclinic.org{link}")
soup = BeautifulSoup(page.content, "html.parser")
for each in soup.select("ul.listGroup + ul li"):
print(each.text)
,并使用相关的CSS选择器更新选择器。
{{1}}
答案 1 :(得分:0)
您可以使用CSS选择器ul.listGroup + ul li
->这将选择<li>
旁边的<ul>
标签的所有<ul>
标签,类别为"listGroup"
:
txt = '''<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>'''
soup = BeautifulSoup(txt, 'html.parser')
for li in soup.select('ul.listGroup + ul li'):
print(li.text)
打印:
Osteoporosis
Kidney stones
Excessive urination
Abdominal pain
Tiring easily or weakness
Depression or forgetfulness
Bone and joint pain
Frequent complaints of illness with no apparent cause
Nausea, vomiting or loss of appetite
答案 2 :(得分:0)
也许您应该考虑使用正则表达式进行捕获。