我正在尝试通过Python从网站上废除英语问题(我已经事先获得了这样做的许可);我正在使用BeautifulSoup
。
英语问题嵌套在标签<div class="question_body">
和</div>
之间。因此,下面是我编写的用于提取所有英语问题的Python代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
for p in range(1,10):
web_page = requests.get('https://www.helpteaching.com/search/index.htm?grade=90&question_type=1&keyword=&entity=7&pageNum={}'.format(p))
# Parse web_page
soup = BeautifulSoup(web_page.text, 'html.parser')
# Create set of results based on HTML tags with desired data
results = soup.find_all('div', attrs={'class':'question_body'})
但是上面的简单代码有点问题,因为我不想从网上搜刮任何属于“小组问题”的内容。标签“ <div class="question_body">
”和“ </div>
”之间也嵌套了“小组问题”(一组基于相同问题文本的不同问题)的内容,但“小组问题”和“非小组问题”是“小组问题”的源html代码之前:
<p class="group_instructions">
This question is a part of a group with common instructions.
<a style="text-decoration:underline;" href="/groups/4913/making-bread">View group »</a>
</p>
例如,以下是网站上一组问题之一的html源代码:
<p class="group_instructions">
This question is a part of a group with common instructions.
<a style="text-decoration:underline;" href="/groups/4913/making-bread">View group »</a>
</p>
<div class="question_body">
<a href="/questions/128621/which-is-not-an-ingredient-the-mother-put-in-the-bread">Which is NOT an ingredient the mother put in the bread?</a>
<ol>
<li class="answer correct">
Sugar
</li>
<li class="answer">
Salt
</li>
<li class="answer">
Yeast
</li>
<li class="answer">
Flour
</li>
</ol>
</div>
</div>
请注意<p class="group_instructions">
在<div class="question_body">
之前。
非分组问题之前没有以<p class="group_instructions">
开头的块。
有什么办法可以避免在网上抓取小组问题?如果有必要,我不需要坚持使用BeautifulSoup。
谢谢
答案 0 :(得分:0)
如果您必须解析不包含某些标记的节点,我认为xpath会更易于使用。如果您愿意的话,我在这里提供了lxml解决方案。
import requests
import pandas as pd
from bs4 import BeautifulSoup
from lxml import html
from lxml import etree
from lxml.etree import HTML
web_page = requests.get('https://www.helpteaching.com/search/index.htm?grade=90&question_type=1&keyword=&entity=7&pageNum=1')
soup = BeautifulSoup(web_page.text, 'html.parser')
tree = etree.fromstringlist(soup, parse=HTML)
#This will extract only questions without group questions node.****
results = etree.XPath('//div[@class="question"][not(.//p)]/div[@class="question_body"]/a/text()')
for result in results:
print(result)