我正在尝试再次进行网络抓取。这次,我使用python-3.6尝试将https://www.reuters.com/finance/stocks/financial-highlights/KEPL3.SA上的表转换为数据框,以便为在巴西证券交易所BOVESPA上市的公司构建Piotroski F- score。尽管我在互联网上查找并找到了Quantopian和Quandl解决方案(现成的和免费的),但它们似乎不适用于巴西资产,因此,我打算至少开始在类似资产上进行开发。我从python和漂亮的汤开始,所以不要介意我的愚蠢代码。
这是我到目前为止所做的:
import requests, bs4
res = requests.get("https://www.reuters.com/finance/stocks/financial-highlights/KEPL3.SA")
res.raise_for_status()
rawsoup = bs4.BeautifulSoup(res.text, "lxml")
for row in rawsoup.find_all('tr'):
cols = row.find_all('td')
print(cols)
哪个给我以下结果:
$ python3 reuters_data.py
[]
[]
[<td>P/E Ratio (TTM)</td>, <td class="data">--</td>, <td class="data">15.32</td>, <td class="data">24.24</td>]
[<td>
P/E High - Last 5 Yrs.</td>, <td class="data">67.86</td>, <td class="data">36.54</td>, <td class="data">39.87</td>]
[<td>
P/E Low - Last 5 Yrs.</td>, <td class="data">9.48</td>, <td class="data">8.71</td>, <td class="data">15.24</td>]
[<td colspan="5"></td>]
[<td>
Beta</td>, <td class="data">0.64</td>, <td class="data">1.33</td>, <td class="data">1.01</td>]
[<td colspan="5"></td>]
[<td>
Price to Sales (TTM)</td>, <td class="data">0.43</td>, <td class="data">1.29</td>, <td class="data">2.27</td>]
[<td>
Price to Book (MRQ)</td>, <td class="data">0.58</td>, <td class="data">2.13</td>, <td class="data">2.70</td>]
[<td>
Price to Tangible Book (MRQ)</td>, <td class="data">0.65</td>, <td class="data">2.74</td>, <td class="data">5.41</td>]
[<td>
Price to Cash Flow (TTM)</td>, <td class="data">--</td>, <td class="data">9.83</td>, <td class="data">15.03</td>]
.
.
.
[<td><strong># Net Buyers:</strong></td>, <td class="data"> <span class="changeUp">1</span> </td>]
(我在中间省略了部分结果,但都在这里)
现在我已经到了墙,我不知道如何正确地将其转换为数据框,因此我实际上可以对表上的这些数字进行数学运算。
感谢您的帮助,如果我的来历不佳或有更好的出处,请随时向我指出。
非常感谢。期待答案。
答案 0 :(得分:1)
您可以像这样使用脚本并从中工作:
import requests, bs4
import pandas as pd
res = requests.get("https://www.reuters.com/finance/stocks/financial-highlights/KEPL3.SA")
res.raise_for_status()
soup = bs4.BeautifulSoup(res.content, features="lxml")
# find all 'table' tags in the html document
data_table = soup.findAll('table', {"class": "dataTable"})
i = 0
# Create an empty dictionary to be used later with pandas
Dict = {}
current_header = ""
# only second table matters that's why the index starts with 1
# then find every tr tag with the class= stripe according to the url you gave
for row in data_table[1].findAll('tr', {"class": "stripe"}):
# find every td tag inside the 'tr' tags
for cells in row.findAll('td'):
# on your case every 4th row is a header so we use this as a dictionary key
if i % 4 == 0:
current_header = str(cells.find(text=True)).strip()
# add some sanitization at the header title and add to the Dictionary
Dict[str(cells.find(text=True)).strip()] = []
else:
data = cells.find(text=True)
# try to parse the data as a float, othewise is a '--'
# and we should use the 0 to better represent the value as a number
try:
data = float(data)
except:
data = 0
# append the values into the dictionary key(header)
Dict[current_header].append(data)
i += 1
# Now you have a data similar to Dict= {'P/E Ratio (TTM)': [0, 15.32, 24.24], ...}
# using pandas we create a data frame from the dictionary
df = pd.DataFrame(Dict)
print(df.head)