Web抓取:在DataFrame中爬网和存储内容

时间:2018-10-29 12:42:06

标签: pandas dataframe web-scraping beautifulsoup

以下代码可用于为三个给定的示例网址重现Web抓取任务:

代码:

import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup

# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)

def scrape_content(url):

    r = requests.get(url)
    html = r.content
    soup = BeautifulSoup(html,"lxml")

    # Get page title
    title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
    # Get content from paragraphs
    content = soup.find("div", {"class":"section-content"}).find_all('p')

    print(title)

    for p in content:
        p = p.get_text(strip=True)
        print(p)

对每个网址应用抓取:

urls['url'].apply(scrape_content)

退出:

Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report

0    None
1    None
2    None
Name: url, dtype: object

问题:

  1. 该代码当前仅输出每页第一段的内容。我喜欢获取给定选择器中每个p的数据。
  2. 对于最终数据,我需要一个包含URL,标题和内容的数据框。因此,我想知道如何将抓取的信息写入数据帧。

谢谢您的帮助。

1 个答案:

答案 0 :(得分:1)

您的问题在这一行:

content = soup.find("div", {"class":"section-content"}).find_all('p')

find_all()正在获取所有<p>标记,但仅在结果.find()中得到-这仅返回符合条件的第一个示例。因此,您将在第一个<p>中获得所有div.section_content标签。尚不清楚使用案例的正确标准是什么,但是如果您只想使用所有<p>标签,则可以使用:

content = soup.find_all('p')

然后,您可以使scrape_urls()合并<p>标签文本,并将其与标题一起返回:

content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content

在功能之外,您可以构建数据框:

url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})