Question

我正在整理存储数千篇文章的整个文章管理系统。我的脚本有效，但问题是beautifulsoup和requests在确定页面是实际文章还是文章未找到页面时花了很长时间。我有大约4000篇文章，通过计算，脚本运行完成的时间是几天。

for article_url in edit_article_list:
    article_edit_page = s.get(article_url, data=payload).text
    article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')

    # Section
    if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
        continue
    else:
        for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
            f.write(thing.get_text(strip=True) + "\t")

第一个if确定网址是好还是坏。 edit_article_list由以下人员制作：

for count in range(87418,307725):
    edit_article_list.append(login_url+"AddEditArticle.aspxArticleID="+str(count))

我的脚本现在检查坏的和好的网址，然后抓取内容。在制作网址列表时，有什么方法可以使用requests获得类似模式的有效网址吗？

Answer 1

要跳过不存在的文章，需要不允许重定向并检查状态代码：

for article_url in edit_article_list:
    r = requests.get(article_url, data=payload, allow_redirects=False)
    if r.status_code != 200:
        continue
    article_edit_page = r.text
    article_edit_soup = BeautifulSoup(article_edit_page, 'lxml')

    # Section
    if article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}) == None:
        continue
    else:
        for thing in article_edit_soup.find("select", {"name":"ctl00$ContentPlaceHolder1$fvArticle$ddlSubMenu"}).findAll("option", {"selected":"selected"}):
            f.write(thing.get_text(strip=True) + "\t")

我确实建议解析实际网址的文章列表页面 - 您目前正在解雇超过200,000个请求，而且只需要4,000篇文章，这是很多开销和流量，而且效率不高！

如何提取类似模式的有效网址？

1 个答案: