我一直在练习我的Web抓取技能,遇到了一个问题,该问题涉及我通过使用id =“ secret-word”抓取段落标记来过滤10个链接。我想出了如何在第一个链接中执行此操作,因此我认为最好循环遍历各个流程并立即抓取它们。
这是我要从中获取链接的网站(位于底部的链接):https://keithgalli.github.io/web-scraping/webpage.html
这是我想出的代码..但是我无法弄清楚如何获取元素0(n = 0)而不将其与while循环分开。也许不能?
我拉出了所有链接的列表:
new_listc = links_with_text1[19:]
['challenge/file_1.html',
'challenge/file_2.html',
'challenge/file_3.html',
'challenge/file_4.html',
'challenge/file_5.html',
'challenge/file_6.html',
'challenge/file_7.html',
'challenge/file_8.html',
'challenge/file_9.html',
'challenge/file_10.html']
我使用请求连接到每个站点:
t = [requests.get(f"https://keithgalli.github.io/web-scraping/{url}", timeout=5) for url in new_listc]
我通过漂亮的汤循环了所有链接,并从每个链接中获取了秘密单词以获取单词列表。我想知道是否有更干净的方法来执行此操作,为什么我必须将第一个挑战文件放到循环之外?!
n=0
tsoup = bs(t[n].content)
test_soup = tsoup.select("p#secret-word")
#print(n)
x = [t.text for t in test_soup]
print(x)
while n in range(0,9):
n += 1
#print(n)
tsoup = bs(t[n].content)
test_soup = tsoup.select("p#secret-word")
x = [t.text for t in test_soup]
print(x)
#print(tsoup.prettify())
if n > 9:
break
['Make']
['sure']
['to']
['smash']
['that']
['like']
['button']
['and']
['subscribe']
['!!!']
答案 0 :(得分:1)
老实说,您的方法确实没有错,而且通过“更清洁的方式”,您要求人们的偏爱,恕我直言。所以,这是我的。
import requests
from bs4 import BeautifulSoup
def get_content(number: int) -> str:
url = f"https://keithgalli.github.io/web-scraping/challenge/file_{number}.html"
soup = BeautifulSoup(
requests.get(url).text, "html.parser"
).select_one("p#secret-word")
return soup.getText(strip=True)
print(" ".join(get_content(number) for number in range(1, 11)))
输出:
Make sure to smash that like button and subscribe !!!
以上答案假设有10
个页面需要循环,但是如果您需要先刮掉主要页面,则可以尝试以下操作:
import requests
from bs4 import BeautifulSoup
the_url = "https://keithgalli.github.io/web-scraping/webpage.html"
def make_soup_first(url: str) -> BeautifulSoup:
return BeautifulSoup(requests.get(url).text, "html.parser")
def get_follow_links(main_link: str) -> list:
soup = make_soup_first(main_link)
return [
a["href"] for a in soup.find_all(
lambda t: t.name == "a" and "File" in t.text
)
]
def get_content(follow_link: str) -> str:
url = f"https://keithgalli.github.io/web-scraping/{follow_link}"
return make_soup_first(url).select_one("p#secret-word").getText(strip=True)
print(" ".join(get_content(link) for link in get_follow_links(the_url)))
给出与上面相同的输出:
Make sure to smash that like button and subscribe !!!
答案 1 :(得分:1)
必须将第一个质询文件置于循环之外的原因是,循环的第一行将n递增1,因此在其第一次迭代时,它访问t[1]
而不是t[0]
。您可以通过将该行移到循环的结尾来解决此问题,但是更干净的方法是使用for循环:
tsoup = bs(t[n].content)
test_soup = tsoup.select("p#secret-word")
for secret_word in test_soup:
print(secret_word.text)