Question

我正在尝试通过Python从网站上废除英语问题（我已经事先获得了这样做的许可）；我正在使用BeautifulSoup。

英语问题嵌套在标签<div class="question_body">和</div>之间。因此，下面是我编写的用于提取所有英语问题的Python代码：

import requests
import pandas as pd
from bs4 import BeautifulSoup

for p in range(1,10):
    web_page = requests.get('https://www.helpteaching.com/search/index.htm?grade=90&question_type=1&keyword=&entity=7&pageNum={}'.format(p))

    # Parse web_page
    soup = BeautifulSoup(web_page.text, 'html.parser')

    # Create set of results based on HTML tags with desired data
    results = soup.find_all('div', attrs={'class':'question_body'})

但是上面的简单代码有点问题，因为我不想从网上搜刮任何属于“小组问题”的内容。标签“ <div class="question_body">”和“ </div>”之间也嵌套了“小组问题”（一组基于相同问题文本的不同问题）的内容，但“小组问题”和“非小组问题”是“小组问题”的源html代码之前：

            <p class="group_instructions">
                This question is a part of a group with common instructions.
                <a style="text-decoration:underline;" href="/groups/4913/making-bread">View group &raquo;</a>
            </p>

例如，以下是网站上一组问题之一的html源代码：

            <p class="group_instructions">
                This question is a part of a group with common instructions.
                <a style="text-decoration:underline;" href="/groups/4913/making-bread">View group &raquo;</a>
            </p>

        <div class="question_body">


        <a href="/questions/128621/which-is-not-an-ingredient-the-mother-put-in-the-bread">Which is NOT an ingredient the mother put in the bread?</a>
            <ol>

                    <li class="answer correct">
                        Sugar               
                    </li>

                    <li class="answer">
                        Salt    
                    </li>

                    <li class="answer">
                        Yeast
                    </li>

                    <li class="answer">
                        Flour    
                    </li>        
            </ol>              
        </div>
    </div>

请注意<p class="group_instructions">在<div class="question_body">之前。非分组问题之前没有以<p class="group_instructions">开头的块。

有什么办法可以避免在网上抓取小组问题？如果有必要，我不需要坚持使用BeautifulSoup。

谢谢

Answer 1

如果您必须解析不包含某些标记的节点，我认为xpath会更易于使用。如果您愿意的话，我在这里提供了lxml解决方案。

import requests
import pandas as pd
from bs4 import BeautifulSoup
from lxml import html
from lxml import etree
from lxml.etree import HTML

web_page = requests.get('https://www.helpteaching.com/search/index.htm?grade=90&question_type=1&keyword=&entity=7&pageNum=1')
soup = BeautifulSoup(web_page.text, 'html.parser')
tree = etree.fromstringlist(soup, parse=HTML)

#This will extract only questions without group questions node.****
results = etree.XPath('//div[@class="question"][not(.//p)]/div[@class="question_body"]/a/text()')

    for result in results:
        print(result)

如何从Python抓取网页中排除某些内容

1 个答案: