在Python中清理HTML Parse

时间:2014-10-21 19:09:18

标签: python html parsing web-scraping beautifulsoup

下面我的代码从(http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY)中删除tr,align ='center'标记内的td元素,用逗号分隔每个元素,并将结果写入文本文件:

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):    
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.strip())

    acct.write(", ".join(stack))

然而,当写入文本文件时,有很多空白行(我想消除它),并且每行不以正确的元素开头。

以下是我的.txt文件与我当前代码的相似之处:

以下是我希望它的样子: enter image description here

如何更改代码以删除所有空白行并让每行以“OPEN”开头,依此类推?

1 个答案:

答案 0 :(得分:1)

问题是您在td.text中有换行符。将其替换为空字符串,并在末尾添加换行符。还可以替换制表符,以匹配您想要的输出:

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('\n', '').replace('\t', '').strip())

    acct.write(", ".join(stack) + '\n')

产地:

STATUS, CRN, SUBJECT, SECT, COURSE, CREDIT, INSTR., BLDG/RM, DAY/TIME, FROM / TO, 
OPEN, 41552, ACCY 2001, 10, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 251, TR09:35AM - 10:50AM, 01/12/15 - 04/27/15, 
OPEN, 40002, ACCY 2001, 11, Intro Financial Accounting, 3.00, Rozenbaum, O, DUQUES 353, TR11:10AM - 12:25PM, 01/12/15 - 04/27/15, 
...