关注BeautifulSoup4的链接

时间:2016-04-20 14:38:31

标签: python-2.7 beautifulsoup

我使用Python从页面中提取链接:

for link in soup.find_all('a', href=True):
    if 'http' in link['href']:
        links.append(link['href'])

如何构建打开每个链接并从“" p"”中提取文本的内容。链接页面上的标签?

2 个答案:

答案 0 :(得分:0)

您可以更改获取原始链接的方式,可能类似于:

links = soup.find_all('a', href=True)

for link in links:
    # code to create soup of the current link html
    if 'http' in link['href']:
        links.append(link['href'])

然后它将继续到新添加的链接,直到完成。

答案 1 :(得分:0)

您可以使用requests获取收集的链接的HTML,然后使用BeautifulSoup进行解析。

import requests
from bs4 import BeautifulSoup

# get links
for link in soup.find_all('a', href=True):
    if link['href'].startswith('http'):
        links.append(link['href'])

# visit links and print paragraphs text
for link in links:
    response = requests.get(link)

   soup = BeautifulSoup(response.content, 'html.parser')

   for p in soup.find_all('p'):
         print p.text

或者没有链接上的两次迭代

import requests
from bs4 import BeautifulSoup

# get links
for link in soup.find_all('a', href=True):
    if link['href'].startswith('http'):
        response = requests.get(link['href'])

         soup = BeautifulSoup(response.content, 'html.parser')

         for p in soup.find_all('p'):
             print p.text