将网页抓取结果存储在DataFrame或字典中

时间:2018-08-02 18:19:42

标签: python dictionary dataframe web-scraping beautifulsoup

我正在上在线课程,并且试图使捕获个人笔记的课程结构的过程自动化,并将其保存在Markdown文件中。

这是一个示例章节:

enter image description here

以下是HTML外观的示例:

  <!-- Header of the chapter -->
  <div class="chapter__header">
      <div class="chapter__title-wrapper">
        <span class="chapter__number">
          <span class="chapter-number">1</span>
        </span>
        <h4 class="chapter__title">
          Introduction to Experimental Design
        </h4>
          <span class="chapter__price">
            Free
          </span>
      </div>
      <div class="dc-progress-bar dc-progress-bar--small chapter__progress">
        <span class="dc-progress-bar__text">0%</span>
        <div class="dc-progress-bar__bar chapter__progress-bar">
          <span class="dc-progress-bar__fill" style="width: 0%;"></span>
        </div>
      </div>
  </div>
  <p class="chapter__description">
    An introduction to key parts of experimental design plus some power and sample size calculations.
  </p>
  <!-- !Header of the chapter -->

<!-- Body of the chapter -->
  <ul class="chapter__exercises hidden">
      <li class="chapter__exercise ">
        <a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
          <span class="chapter__exercise-icon exercise-icon ">
            <img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
          </span>
          <h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
          <span class="chapter__exercise-xp">
            50 xp
          </span>
</a>      </li>

到目前为止,我已经使用BeautifulSoup提取了所有相关信息:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

outline_list = []

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            outline_list.append(item.text.strip())
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            outline_list.append(lesson_name)
            outline_list.append(lesson_link)
    except KeyError:
        pass

这会给我这样的列表:

['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]

我的目标是将所有内容都放入一个.md文件中,该文件看起来像这样:

# Introduction to Experimental Design

* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)

我的问题是:构造数据的最佳方法是什么,以便以后在编写文本文件时可以轻松访问它?拥有一个带有chapterlessonlesson_link列的DataFrame会更好吗?一个具有MultiIndex的DataFrame?嵌套字典?如果是字典,我应该给键命名什么?还是我错过了另一个选择?某种数据库?

任何想法都将不胜感激!

1 个答案:

答案 0 :(得分:0)

如果我看对的话,那么您当前正在按顺序将每个元素添加到列表outline_list中。但是显然您没有1种,而是3种类型的不同数据:

  • chapter__title
  • chapter__exercise.name
  • chapter__exercise.link

每个标题可以有多个练习,它们总是一对namelink。由于您还希望将数据保留在此文本文件的结构中,因此可以提出代表该层次结构的任何结构。一个例子:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict

url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

lesson_outline = soup.find_all(['h4', 'li'])

# Using OrderedDict assures that the order of the result will be the same as in the source
chapters = OrderedDict()   # {chapter: [(lesson_name, lesson_link), ...], ...}

for item in lesson_outline:
    attributes = item.attrs
    try:
        class_type = attributes['class'][0]
        if class_type == 'chapter__title':
            chapter = item.text.strip()
            chapters[chapter] = []
        if class_type == 'chapter__exercise':
            lesson_name = item.find('h5').text
            lesson_link = item.find('a').attrs['href']
            chapters[chapter].append((lesson_name, lesson_link))
    except KeyError:
        pass

从那里开始编写文本文件应该很容易:

for chapter, lessons in chapters.items():
    # write chapter title
    for lesson_name, lesson_link in lessons:
        # write lesson