如何清理这个webscraping脚本中的数据?

时间:2017-10-02 14:50:55

标签: python css python-3.x web-scraping beautifulsoup

所以这是我的代码:

import requests
from bs4 import BeautifulSoup
import lxml

r = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(r.text, "lxml")

tables = soup.find_all('table')
print(tables)



print(tables)

我不得不做一个帖子请求,因为它是一个ASP页面,我不得不抓取正确的数据。从特定学期的所有表中查看商学院。问题是输出:

<tr class="tableback2"><td>Overall assessment of instructor</td><td align="right">0.0%</td><td align="right">56.8%</td><td align="right">27.0%</td><td align="right">13.5%</td><td align="right">2.7%</td><td align="right">0.0%</td> </tr>
</table>, <table align="center" border="0" cellpadding="0" cellspacing="0" width="75%">
<tr class="boldtxt"><td>Term: 1175 - Summer 2017</td></tr><tr class="boldtxt"><td>Instructor Name: Austin, Lathan Craig</td><td colspan="6"> Department: MARKETING</td></tr>
<tr class="boldtxt"><td>Course: TRA   4721  </td><td colspan="2">Section: RVBB-1</td><td colspan="4">Title: Global Logistics</td></tr>
<tr class="boldtxt"><td>Enrolled: 56</td><td colspan="2">Ref#: 55703 -1</td><td colspan="4"> Completed Forms: 46</td></tr>

我希望beautifulsoup能够解析文本,并将其整齐地返回到数据框中,每个列都是分开的。我想把它放到数据帧之后,或者可能将它保存到CSV文件中......但我不知道如何摆脱所有这些CSS选择器和标签。我尝试使用此代码执行此操作,并删除指定的代码,但td和tr不起作用:

for tag in soup():
    for attribute in ["class", "id", "name", "style", "td", "tr"]:
        del tag[attribute]

然后,I tried to use this package called bleach, but when putting the 'tables' into it但它指定它必须是文本输入。所以我显然不能把我的桌子放进去。 This is ideally what I would like to see with my output.

所以我真的不知道如何以正确的方式格式化它。非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

试一试。我想这就是你的期望。顺便说一句,如果该页面中有多个表,如果你想要另一个表,那么就像在soup.select('table')[n]中那样抽取索引。感谢。

import requests
from bs4 import BeautifulSoup

res = requests.post('https://opir.fiu.edu/instructor_evals/instr_eval_result.asp', data={'Term': '1175', 'Coll': 'CBADM'})
soup = BeautifulSoup(res.text, "lxml")

tables = soup.select('table')[0]
list_items = [[items.text.replace("\xa0","") for items in list_item.select("td")]
                    for list_item in tables.select("tr")] 

for data in list_items:
    print(' '.join(data))

部分结果:

Term: 1175 - Summer 2017
Instructor Name: Elias, Desiree   Department: SCHACCOUNT
Course: ACG   2021   Section: RVCC-1 Title: ACC Decisions
Enrolled: 118 Ref#: 51914 -1  Completed Forms: 36