我使用Python从页面中提取链接:
for link in soup.find_all('a', href=True):
if 'http' in link['href']:
links.append(link['href'])
如何构建打开每个链接并从“" p"”中提取文本的内容。链接页面上的标签?
答案 0 :(得分:0)
您可以更改获取原始链接的方式,可能类似于:
links = soup.find_all('a', href=True)
for link in links:
# code to create soup of the current link html
if 'http' in link['href']:
links.append(link['href'])
然后它将继续到新添加的链接,直到完成。
答案 1 :(得分:0)
您可以使用requests
获取收集的链接的HTML,然后使用BeautifulSoup
进行解析。
import requests
from bs4 import BeautifulSoup
# get links
for link in soup.find_all('a', href=True):
if link['href'].startswith('http'):
links.append(link['href'])
# visit links and print paragraphs text
for link in links:
response = requests.get(link)
soup = BeautifulSoup(response.content, 'html.parser')
for p in soup.find_all('p'):
print p.text
或者没有链接上的两次迭代
import requests
from bs4 import BeautifulSoup
# get links
for link in soup.find_all('a', href=True):
if link['href'].startswith('http'):
response = requests.get(link['href'])
soup = BeautifulSoup(response.content, 'html.parser')
for p in soup.find_all('p'):
print p.text