Question

我正在尝试从网站上抓取一些数据，但对于Python / HTML来说是新手，可能需要一些帮助。

这是有效的代码部分：

from bs4 import BeautifulSoup
import requests
page_link ='http://www.some-website.com'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(id='yyy')
print(data)

这成功获取了我要抓取的数据，在打印时显示如下

<div class="generalData" id="yyy">
<div class="generalDataBox">

<div class="rowText">
<label class="some-class-here" title="some-title-here">
Title Name
</label>
<span class="" id="">###</span>
</div>

<div class="rowText">
<label class="same-class-here" title="another-title-here">
Another Title Name
</label>
<span class="" id="">###2</span>
</div>

... more rows here ...

</div></div>

将其放入熊猫数据框的最佳方法是什么？理想情况下，它将具有两列：一列具有标签名称（即上面的“标题名称”或“另一个标题名称”），另一列具有数据（即上面的###和### 2）。

谢谢！

Answer 1

首先提取部分：

html = """<div class="generalData" id="yyy">
<div class="generalDataBox">

<div class="rowText">
<label class="same-class-here" title="some-title-here">Title Name</label>
<span class="" id="">###</span>
</div>

<div class="rowText">
<label class="same-class-here" title="another-title-here">Another Title Name</label>
<span class="" id="">###2</span>
</div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

hashList = list()
titleList = list()

rangeLen = len(soup.find_all('label', class_="same-class-here"))

for i in range(rangeLen):
    titleList.append(soup.find_all('label', class_="same-class-here")[i].get_text())
    hashList.append(soup.find_all('span')[i].get_text())

现在，一旦您提取了所需的内容（在本例中为两列的值），我们便使用熊猫将其放入数据框。

import pandas as pd

df = pd.DataFrame()
df['Title'] = titleList
df['Hash'] = hashList

输出：

                Title  Hash
0          Title Name   ###
1  Another Title Name  ###2

使用BeautifulSoup解析和提取数据到熊猫

1 个答案: