这里是一个完整的初学者...我正在尝试从此Wikipedia page刮下成分表,但是刮掉的表是年度收益(第一张表),而不是我需要的成分表(第二张表) 。有人可以帮忙看看是否有任何方法可以使用BeautifulSoup4定位到想要的特定表?
import bs4 as bs
import pickle
import requests
def save_klci_tickers():
resp = requests.get ('https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI')
soup = bs.BeautifulSoup(resp.text)
table = soup.find ('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll ('tr') [1:]:
ticker = row.findAll ('td') [0].text
tickers.append(ticker)
with open ("klcitickers.pickle", "wb") as f:
pickle.dump (tickers, f)
print (tickers)
return tickers
save_klci_tickers()
答案 0 :(得分:1)
眨眼间,尝试熊猫库从csv文件中的页面中获取表格数据:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI'
df = pd.read_html(url, attrs={"class": "wikitable"})[1] #change the index to get the table you need from that page
new = pd.DataFrame(df, columns=["Constituent Name", "Stock Code", "Sector"])
new.to_csv("wiki_data.csv", index=False)
print(df)
如果您仍要使用BeautifulSoup,则应遵循以下目的:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI")
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("table.wikitable")[1].select("tr"):
data = [item.get_text(strip=True) for item in items.select("th,td")]
print(data)
如果您想使用.find_all()
而不是.select()
,请尝试以下操作:
for items in soup.find_all("table",class_="wikitable")[1].find_all("tr"):
data = [item.get_text(strip=True) for item in items.find_all(["th","td"])]
print(data)