我正在尝试几种方法来循环遍历TR元素和TE元素,并且发现如何循环遍历表中的这些行以导入要获取的数据。然后,我发现了一种无需循环即可获取相同数据的简便方法。这是我的代码。
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table')[0]
print(table.prettify())
唯一的问题是,数据采用了所有HTML格式,就像这样。
<table id="holidayTable">
<tr>
<th class="left light" colspan="3">
Holiday
</th>
<th class="left light">
Markets Closed
</th>
</tr>
<tr>
<td class="bold left" valign="top">
01/01/2018
如何清除此数据并将其加载到数据框中?我希望它看起来像这样。
感谢您抽出宝贵的时间来看看!!
答案 0 :(得分:1)
简单的方法是:
import pandas as pd
url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"
dfs = pd.read_html(url)
df = dfs[0]
但这是练习BeautifulSoup的一个很好的例子,因为它与标签非常干净。您找到了表标签,现在只需要遍历行并将其放入数据框即可。
首先,我初始化一个空白数据框来存储我的结果:
results = pd.DataFrame()
然后我在您存储的表中找到所有tr
标签:
rows = table.find_all('tr')
接下来,对于每一行,我找到标记为td
的数据并放入列表中:
data = row.find_all('td')
row_data = [ x.text for x in data ]
我将其放入一个临时数据框中,用于附加到初始结果数据框中:
temp_df = pd.DataFrame([row_data])
results = results.append(temp_df)
然后最后我放空行并重置索引。我不知道您想要的列是什么,但是您可以在最后一行重命名这些列。或者列标题通常是表的th
标记,您可以随时返回并获取它们。
完整代码:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table')[0]
results = pd.DataFrame()
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
row_data = [ x.text for x in data ]
temp_df = pd.DataFrame([row_data])
results = results.append(temp_df)
results = results.dropna(how='all').reset_index(drop = True)
results.columns = ['col1', 'col2', 'col3', 'col4']
答案 1 :(得分:0)
以稍微不同的方式进行相同操作,您可以查看以下方法。在脚本中使用[1:]
来踢出th
值。我试图摆脱冗余:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
data = pd.DataFrame()
for rows in soup.find(id='holidayTable').find_all('tr')[1:]:
tds = [row.text for row in rows.find_all('td')]
add_list_to_df = pd.DataFrame([tds])
data = data.append(add_list_to_df)
df = pd.DataFrame({"Header1":data[0],"Header2":data[1],"Header3":data[2],"Header4":data[3]}).to_string(index=False)
print(df)