Question

我正在使用漂亮的汤通过以下代码从Wikipedia抓取一些内容：

import requests
from bs4 import BeautifulSoup
import urllib.request, json

s = 'September%2011'
url = 'https://en.wikipedia.org/w/api.php?action=query&titles={0}&prop=revisions&rvprop=content&rvsection=1&format=xml&formatversion=2'.format(s)
r = requests.get(url)
print(r.status_code)
content = r.text
events = []
soup = BeautifulSoup(content, "lxml")
events = [events.text for events in soup.find_all("rev")]
print(events)

内容是在特定日期发生的事件。在Wikipedia上，每个事件都显示为一个点，但是从API来看，它是一个很长的列表：

https://en.wikipedia.org/w/api.php?action=query&titles=September%2011&prop=revisions&rvprop=content&rvsection=1&format=xml&formatversion=2

我想将内容放入数据帧，并为每个事件（即每次有一个“ \ n *”）单独放置一行。

我已经查看了一些关于列表拆分的答案，但不知道在这种情况下如何应用。

Answer 1

尝试一下：

df = pd.DataFrame(events[0].split('\n*')[1:], columns=["Events"])
print(df)

Events
0   [[1185]] &ndash; [[Isaac II Angelos]] kills [[...
1   [[1226]] &ndash; The first recorded instance o...
2   [[1297]] &ndash; [[Battle of Stirling Bridge]]...
3   [[1390]] &ndash; [[Lithuanian Civil War (1389–...
...

通过“ * \ n”将列表内容拆分为熊猫数据框

1 个答案: