Question

在浏览器中打开以下网址时

http://www.kianfunds2.com/%D8%A7%D8%B1%D8%B2%D8%B4-%D8%AF%D8%A7%D8%B1%D8%A7%DB%8C%DB%8C-%D9%87%D8%A7-%D9%88-%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF-%D9%88%D8%A7%D8%AD%D8%AF-%D9%87%D8%A7

你会看到名为＆＃34;复制＆＃34;的紫色图标。当您选择此图标（＆＃34;复制＆＃34;）时，您将获得一个可以粘贴到Excel中的完整表格。如何在Python中将此表作为输入？

我的代码如下，并且没有显示任何内容：

import requests
from bs4 import BeautifulSoup
url = "http://www.kianfunds2.com/" + "ارزش-دارایی-ها-و-تعداد-واحد-ها"
result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")
table = soup.find("a", class_="dt-button buttons-copy buttons-html5")

我不想使用Selenium，因为它需要花费很多时间。请使用美丽的汤。

Answer 1

对我来说，似乎没有必要在这里使用任何类型的网页抓取。由于您无论如何都可以将数据作为文件下载，因此通过报废来表示数据所需的解析是不够的。

相反，您只需下载数据并将其读入pandas数据帧即可。您需要安装pandas，如果您安装了Anaconda，您可能已经在计算机上安装了它，否则您可能需要下载Anaconda并安装pandas： conda安装熊猫

More Information on Installing Pandas

使用pandas，您可以直接从excel-sheet读取数据：

import pandas as pd
df = pd.read_excel("dataset.xlsx")

pandas.read_excel documentation

如果这会造成困难，您仍然可以将excel表转换为csv并使用pd.read_csv。请注意，您将要使用正确的编码。

如果您出于某种原因想要使用BeautifulSoup：您可能需要查看how to parse tables。对于普通表，您可能希望识别要正确刮取的内容。该特定网站上的表格的ID为＆＃34; arzeshdarayi＆＃34;。它也是该页面上唯一的表格，因此您也可以使用<table> - 标签来选择它。

table = soup.find("table", id="arzeshdarayi")
table = soup.select("#arzeshdarayi")

您提供的网站上的表格只有一个静态标题，数据呈现为javascript，而BeautifulSoup无法检索信息。然而，您可以使用javascript使用的[json-object] 再次，将其作为数据框阅读：

import requests
import pandas pd
r = requests.get("http://www.kianfunds2.com/json/gettables.ashx?get=arzeshdarayi")
dict = r.json()
df = pd.DataFrame.from_dict(data)

如果你真的想要抓它，你需要某种浏览器模拟，所以在你访问html之前会评估Javascript。 This answer建议使用Requests_HTML，这是一种非常高级的网络抓取方法，它将请求，BS和呈现Javascript的方法结合在一起。您的代码看起来有点像这样：

import requests_html as request
session = request.HTMLSession()
url = "http://www.kianfunds2.com/ارزش-دارایی-ها-و-تعداد-واحد-ها"
r = session.get(url)

#Render the website including javascript
#Uses Chromium (will be downloaded on first execution)
r.html.render(sleep=1) 

#Find the table by it's id and take only the first result
table = r.html.find("#arzeshdarayi")[0] 

#Find the single table rows 
#Loop through those rows
for items in table.find("tr"):
        #Take only the item.text for all elements
        #While extracting the Headings and Data from the Tablerows

        data = [item.text for item in items.find("th,td")[:-1]]
        print(data)

使用BS4进行网络抓取：无法获取表格

1 个答案: