BeaufitulSoup删除表格中的br并添加完整链接

时间:2018-03-22 11:22:06

标签: python beautifulsoup

我终于越来越接近从特定网站中提取表格,但我的问题是我似乎无法弄清楚如何

  1. 显示下载文件的完整链接
  2. 删除特定行中的br
  3. html代码如下

    <table border="1" cellpadding="5" cellspacing="0">
    <tr class="bg">
    <td><strong>Reference</strong></td>
    <td stytle="width:100px"><strong>Description</strong></td>
    <td><strong>Download Documents</strong></td>
    <td stytle="width:50px"><strong>Closing Date</strong></td>
    <td stytle="width:50px"><strong>Contact Details</strong></td>
    <td><strong>Briefing</strong></td>
    <!--<td><strong>PUBLISHED</strong></td>-->
    </tr>
    <tr>
    <td>123456</td>
    <td>text 123</td>
    <td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
    <td>2 weeks</td>
    <td>me<br />
      you</td>
    <td>next week</td>
    </tr>
    <tr>
    <td>123456</td>
    <td>text 123</td>
    <td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
    <td>2 weeks</td>
    <td>me<br />
      you</td>
    <td>next week</td>
    </tr>
    <tr>
    <td>123456</td>
    <td>text 123</td>
    <td><a href="/downloads/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
    <td>2 weeks</td>
    <td>me<br />
      you</td><td>next week</td>
    </tr>
    <tr>
    <td>123456</td>
    <td>text 123</td>
    <td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
    <td>2 weeks</td>
    <td>me<br />
      you</td><td>next week</td>
    </tr>
    <tr>
    <td>123456</td>
    <td>text 123</td>
    <td><a href="/downloads/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
    <td>2 weeks</td>
    <td>me</td>
    <td>next week</td>
    </tr>
    <tr>
    <td>123456</td>
    <td>text 123</td>
    <td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
    <td>2 weeks</td>
    <td>me</td>
    <td>next week</td>
    </tr>
    </table>
    

    我想要删除联系人详细信息中的 br ,并显示完整链接而不是“文档点击此处”。

    请注意,这是一个示例表 - 从原始项目重建。

    我的python代码运行正常,只是它将新的链接后的内容添加到整个output.csv中。

    !/ usr / bin / env python

    - - 编码:utf-8 - -

    import csv
    import requests
    import os
    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup
    from bs4 import Tag 
    
    testwebsite = 'https://example.com'
    
    uClient = uReq(testwebsite)
    page_html = uClient.read()
    uClient.close()
    
    page_soup = soup(page_html, "html.parser")
    
    testwebsitetendersaved=""
    #Table is very ugly formated in a span tag and tables within tables
    testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]
    
    for record in testwebsite_container.findAll('tr'):
        testwebsitetender=""
        for data in record.findAll('td'):
            testwebsitetender=testwebsitetender+","+data.text
        testwebsitetendersaved = testwebsitetendersaved + "\n" + testwebsitetender[1:]
    
    
    header="Tender Number, Description, Documents Link, Closing Date, Contact Details, Briefing"+"\n"
    file = open(os.path.expanduser("output.csv"), "wb")
    file.write(bytes(header, encoding="ascii",errors='ignore'))
    file.write(bytes(testwebsitetendersaved, encoding="ascii",errors='ignore'))
    
    print(testwebsitetendersaved)
    

1 个答案:

答案 0 :(得分:0)

我希望这就是你想要的。

testwebsitetendersaved=""
#Table is very ugly formated in a span tag and tables within tables
testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]

header="Tender Number, Description, Documents Link, Closing Date, Contact Details, Briefing"+"\n"
file = open(os.path.expanduser("output.csv"), "wb")
file.write(bytes(header, encoding="ascii",errors='ignore'))

skiptrcnt=1 # skip first tr block
for i,record in enumerate(testwebsite_container.findAll('tr')):
    if skiptrcnt>i:
        continue
    testwebsitetender=""
    tnum = record('td')[0].text
    desc = record('td')[1].text
    doclink = record('td')[2].text
    alink = record('td')[2].find("a")
    if alink :
        doclinkurl=testwebsite+alink['href']
    closingdate = record('td')[3].text
    detail = record('td')[4].text
    detail = detail.replace('\n', '')
    brief = record('td')[5].text
    brief = brief.replace('\n', '')
    print(tnum, desc, doclink, doclinkurl, closingdate, detail, brief)
    testwebsitetendersaved="{},{},{},{},{},{}\n".format(tnum, desc, doclink, doclinkurl, closingdate, detail, brief)
    file.write(bytes(testwebsitetendersaved, encoding="ascii",errors='ignore'))
file.close()

我的输出是

123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/downloads/linktofile.zip 2 weeks me next week
123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me next week