我正在上在线课程,并且试图使捕获个人笔记的课程结构的过程自动化,并将其保存在Markdown文件中。
这是一个示例章节:
以下是HTML外观的示例:
<!-- Header of the chapter -->
<div class="chapter__header">
<div class="chapter__title-wrapper">
<span class="chapter__number">
<span class="chapter-number">1</span>
</span>
<h4 class="chapter__title">
Introduction to Experimental Design
</h4>
<span class="chapter__price">
Free
</span>
</div>
<div class="dc-progress-bar dc-progress-bar--small chapter__progress">
<span class="dc-progress-bar__text">0%</span>
<div class="dc-progress-bar__bar chapter__progress-bar">
<span class="dc-progress-bar__fill" style="width: 0%;"></span>
</div>
</div>
</div>
<p class="chapter__description">
An introduction to key parts of experimental design plus some power and sample size calculations.
</p>
<!-- !Header of the chapter -->
<!-- Body of the chapter -->
<ul class="chapter__exercises hidden">
<li class="chapter__exercise ">
<a class="chapter__exercise-link" href="https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1">
<span class="chapter__exercise-icon exercise-icon ">
<img width="23" height="23" src="https://cdn.datacamp.com/main-app/assets/courses/icon_exercise_video-3b15ea50771db747f7add5f53e535066f57d9f94b4b0ebf1e4ddca0347191bb8.svg" alt="Icon exercise video" />
</span>
<h5 class="chapter__exercise-title" title='Intro to Experimental Design'>Intro to Experimental Design</h5>
<span class="chapter__exercise-xp">
50 xp
</span>
</a> </li>
到目前为止,我已经使用BeautifulSoup
提取了所有相关信息:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
outline_list = []
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
outline_list.append(item.text.strip())
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
outline_list.append(lesson_name)
outline_list.append(lesson_link)
except KeyError:
pass
这会给我这样的列表:
['Introduction to Experimental Design', 'Intro to Experimental Design', 'https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1',...]
我的目标是将所有内容都放入一个.md
文件中,该文件看起来像这样:
# Introduction to Experimental Design
* [Intro to Experimental Design](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=1)
* ['A basic experiment](https://campus.datacamp.com/courses/experimental-design-in-r/introduction-to-experimental-design?ex=2)
我的问题是:构造数据的最佳方法是什么,以便以后在编写文本文件时可以轻松访问它?拥有一个带有chapter
,lesson
,lesson_link
列的DataFrame会更好吗?一个具有MultiIndex的DataFrame?嵌套字典?如果是字典,我应该给键命名什么?还是我错过了另一个选择?某种数据库?
任何想法都将不胜感激!
答案 0 :(得分:0)
如果我看对的话,那么您当前正在按顺序将每个元素添加到列表outline_list
中。但是显然您没有1种,而是3种类型的不同数据:
chapter__title
chapter__exercise.name
chapter__exercise.link
每个标题可以有多个练习,它们总是一对name
和link
。由于您还希望将数据保留在此文本文件的结构中,因此可以提出代表该层次结构的任何结构。一个例子:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
url = 'https://www.datacamp.com/courses/experimental-design-in-r'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
lesson_outline = soup.find_all(['h4', 'li'])
# Using OrderedDict assures that the order of the result will be the same as in the source
chapters = OrderedDict() # {chapter: [(lesson_name, lesson_link), ...], ...}
for item in lesson_outline:
attributes = item.attrs
try:
class_type = attributes['class'][0]
if class_type == 'chapter__title':
chapter = item.text.strip()
chapters[chapter] = []
if class_type == 'chapter__exercise':
lesson_name = item.find('h5').text
lesson_link = item.find('a').attrs['href']
chapters[chapter].append((lesson_name, lesson_link))
except KeyError:
pass
从那里开始编写文本文件应该很容易:
for chapter, lessons in chapters.items():
# write chapter title
for lesson_name, lesson_link in lessons:
# write lesson