Question

我使用Python从页面中提取链接：

for link in soup.find_all('a', href=True):
    if 'http' in link['href']:
        links.append(link['href'])

如何构建打开每个链接并从“＆＃34; p＆＃34;”中提取文本的内容。链接页面上的标签？

Answer 1

您可以更改获取原始链接的方式，可能类似于：

links = soup.find_all('a', href=True)

for link in links:
    # code to create soup of the current link html
    if 'http' in link['href']:
        links.append(link['href'])

然后它将继续到新添加的链接，直到完成。

Answer 2

您可以使用requests获取收集的链接的HTML，然后使用BeautifulSoup进行解析。

import requests
from bs4 import BeautifulSoup

# get links
for link in soup.find_all('a', href=True):
    if link['href'].startswith('http'):
        links.append(link['href'])

# visit links and print paragraphs text
for link in links:
    response = requests.get(link)

   soup = BeautifulSoup(response.content, 'html.parser')

   for p in soup.find_all('p'):
         print p.text

或者没有链接上的两次迭代

import requests
from bs4 import BeautifulSoup

# get links
for link in soup.find_all('a', href=True):
    if link['href'].startswith('http'):
        response = requests.get(link['href'])

         soup = BeautifulSoup(response.content, 'html.parser')

         for p in soup.find_all('p'):
             print p.text

关注BeautifulSoup4的链接

2 个答案: