屏幕抓取表,清理并加载到数据框

时间:2018-12-20 22:55:57

标签: python python-3.x dataframe web-scraping beautifulsoup

我正在尝试几种方法来循环遍历TR元素和TE元素,并且发现如何循环遍历表中的这些行以导入要获取的数据。然后,我发现了一种无需循环即可获取相同数据的简便方法。这是我的代码。

from bs4 import BeautifulSoup
import requests
import pandas as pd            
url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table')[0]
print(table.prettify())

唯一的问题是,数据采用了所有HTML格式,就像这样。

<table id="holidayTable">
 <tr>
  <th class="left light" colspan="3">
   Holiday
  </th>
  <th class="left light">
   Markets Closed
  </th>
 </tr>
 <tr>
  <td class="bold left" valign="top">
   01/01/2018

如何清除此数据并将其加载到数据框中?我希望它看起来像这样。

enter image description here

感谢您抽出宝贵的时间来看看!!

2 个答案:

答案 0 :(得分:1)

简单的方法是:

import pandas as pd

url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"

dfs = pd.read_html(url)
df = dfs[0]

但这是练习BeautifulSoup的一个很好的例子,因为它与标签非常干净。您找到了表标签,现在只需要遍历行并将其放入数据框即可。

首先,我初始化一个空白数据框来存储我的结果:

results = pd.DataFrame()

然后我在您存储的表中找到所有tr标签:

rows = table.find_all('tr')

接下来,对于每一行,我找到标记为td的数据并放入列表中:

data = row.find_all('td')
row_data = [ x.text for x in data ]

我将其放入一个临时数据框中,用于附加到初始结果数据框中:

temp_df = pd.DataFrame([row_data])
results = results.append(temp_df)

然后最后我放空行并重置索引。我不知道您想要的列是什么,但是您可以在最后一行重命名这些列。或者列标题通常是表的th标记,您可以随时返回并获取它们。

完整代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd            
url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
table = soup.find_all('table')[0]


results = pd.DataFrame()
rows = table.find_all('tr')
for row in rows:
    data = row.find_all('td')
    row_data = [ x.text for x in data ]
    temp_df = pd.DataFrame([row_data])

    results = results.append(temp_df)

results = results.dropna(how='all').reset_index(drop = True)
results.columns = ['col1', 'col2', 'col3', 'col4']

答案 1 :(得分:0)

以稍微不同的方式进行相同操作,您可以查看以下方法。在脚本中使用[1:]来踢出th值。我试图摆脱冗余:

from bs4 import BeautifulSoup
import requests
import pandas as pd 

url = "https://markets.on.nytimes.com/research/markets/holidays/holidays.asp?display=market&exchange=SGO"

res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
data = pd.DataFrame()

for rows in soup.find(id='holidayTable').find_all('tr')[1:]:
    tds = [row.text for row in rows.find_all('td')]
    add_list_to_df = pd.DataFrame([tds])
    data = data.append(add_list_to_df)

df = pd.DataFrame({"Header1":data[0],"Header2":data[1],"Header3":data[2],"Header4":data[3]}).to_string(index=False)
print(df)