Question

我一直在练习我的Web抓取技能，遇到了一个问题，该问题涉及我通过使用id =“ secret-word”抓取段落标记来过滤10个链接。我想出了如何在第一个链接中执行此操作，因此我认为最好循环遍历各个流程并立即抓取它们。

这是我要从中获取链接的网站（位于底部的链接）：https://keithgalli.github.io/web-scraping/webpage.html

这是我想出的代码..但是我无法弄清楚如何获取元素0（n = 0）而不将其与while循环分开。也许不能？

我拉出了所有链接的列表：

new_listc = links_with_text1[19:]

['challenge/file_1.html',
 'challenge/file_2.html',
 'challenge/file_3.html',
 'challenge/file_4.html',
 'challenge/file_5.html',
 'challenge/file_6.html',
 'challenge/file_7.html',
 'challenge/file_8.html',
 'challenge/file_9.html',
 'challenge/file_10.html']

我使用请求连接到每个站点：

t = [requests.get(f"https://keithgalli.github.io/web-scraping/{url}", timeout=5) for url in new_listc]

我通过漂亮的汤循环了所有链接，并从每个链接中获取了秘密单词以获取单词列表。我想知道是否有更干净的方法来执行此操作，为什么我必须将第一个挑战文件放到循环之外？！

 n=0
    tsoup = bs(t[n].content)  
        
    test_soup = tsoup.select("p#secret-word")
    #print(n)
    x = [t.text for t in test_soup]
    print(x)
    while n in range(0,9):
        n += 1
        #print(n)
        tsoup = bs(t[n].content)  
        test_soup = tsoup.select("p#secret-word")
        x = [t.text for t in test_soup]
        print(x)
        #print(tsoup.prettify())
        if n > 9:
            break

['Make']
['sure']
['to']
['smash']
['that']
['like']
['button']
['and']
['subscribe']
['!!!']

Answer 1

老实说，您的方法确实没有错，而且通过“更清洁的方式”，您要求人们的偏爱，恕我直言。所以，这是我的。

import requests
from bs4 import BeautifulSoup


def get_content(number: int) -> str:
    url = f"https://keithgalli.github.io/web-scraping/challenge/file_{number}.html"
    soup = BeautifulSoup(
        requests.get(url).text, "html.parser"
    ).select_one("p#secret-word")
    return soup.getText(strip=True)


print(" ".join(get_content(number) for number in range(1, 11)))

输出：

Make sure to smash that like button and subscribe !!!

以上答案假设有10个页面需要循环，但是如果您需要先刮掉主要页面，则可以尝试以下操作：

import requests
from bs4 import BeautifulSoup

the_url = "https://keithgalli.github.io/web-scraping/webpage.html"


def make_soup_first(url: str) -> BeautifulSoup:
    return BeautifulSoup(requests.get(url).text, "html.parser")


def get_follow_links(main_link: str) -> list:
    soup = make_soup_first(main_link)
    return [
        a["href"] for a in soup.find_all(
            lambda t: t.name == "a" and "File" in t.text
        )
    ]


def get_content(follow_link: str) -> str:
    url = f"https://keithgalli.github.io/web-scraping/{follow_link}"
    return make_soup_first(url).select_one("p#secret-word").getText(strip=True)


print(" ".join(get_content(link) for link in get_follow_links(the_url)))

给出与上面相同的输出：

Make sure to smash that like button and subscribe !!!

Answer 2

必须将第一个质询文件置于循环之外的原因是，循环的第一行将n递增1，因此在其第一次迭代时，它访问t[1]而不是t[0]。您可以通过将该行移到循环的结尾来解决此问题，但是更干净的方法是使用for循环：

tsoup = bs(t[n].content)  
test_soup = tsoup.select("p#secret-word")
for secret_word in test_soup:
    print(secret_word.text)

使用范围抓取漂亮汤中的链接列表

2 个答案: