！/ usr / bin / env python

Question

我终于越来越接近从特定网站中提取表格，但我的问题是我似乎无法弄清楚如何

显示下载文件的完整链接
删除特定行中的br

html代码如下

<table border="1" cellpadding="5" cellspacing="0">
<tr class="bg">
<td><strong>Reference</strong></td>
<td stytle="width:100px"><strong>Description</strong></td>
<td><strong>Download Documents</strong></td>
<td stytle="width:50px"><strong>Closing Date</strong></td>
<td stytle="width:50px"><strong>Contact Details</strong></td>
<td><strong>Briefing</strong></td>
<!--<td><strong>PUBLISHED</strong></td>-->
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td>
<td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td>
<td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td><td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me<br />
  you</td><td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me</td>
<td>next week</td>
</tr>
<tr>
<td>123456</td>
<td>text 123</td>
<td><a href="/downloads/linktofile.zip" target="_blank">Documents click here </a></td>
<td>2 weeks</td>
<td>me</td>
<td>next week</td>
</tr>
</table>

我想要删除联系人详细信息中的 br ，并显示完整链接而不是“文档点击此处”。

请注意，这是一个示例表 - 从原始项目重建。

我的python代码运行正常，只是它将新的链接后的内容添加到整个output.csv中。

！/ usr / bin / env python

- - 编码：utf-8 - -

import csv
import requests
import os
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from bs4 import Tag 

testwebsite = 'https://example.com'

uClient = uReq(testwebsite)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

testwebsitetendersaved=""
#Table is very ugly formated in a span tag and tables within tables
testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]

for record in testwebsite_container.findAll('tr'):
    testwebsitetender=""
    for data in record.findAll('td'):
        testwebsitetender=testwebsitetender+","+data.text
    testwebsitetendersaved = testwebsitetendersaved + "\n" + testwebsitetender[1:]


header="Tender Number, Description, Documents Link, Closing Date, Contact Details, Briefing"+"\n"
file = open(os.path.expanduser("output.csv"), "wb")
file.write(bytes(header, encoding="ascii",errors='ignore'))
file.write(bytes(testwebsitetendersaved, encoding="ascii",errors='ignore'))

print(testwebsitetendersaved)

Answer 1

我希望这就是你想要的。

testwebsitetendersaved=""
#Table is very ugly formated in a span tag and tables within tables
testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]

header="Tender Number, Description, Documents Link, Closing Date, Contact Details, Briefing"+"\n"
file = open(os.path.expanduser("output.csv"), "wb")
file.write(bytes(header, encoding="ascii",errors='ignore'))

skiptrcnt=1 # skip first tr block
for i,record in enumerate(testwebsite_container.findAll('tr')):
    if skiptrcnt>i:
        continue
    testwebsitetender=""
    tnum = record('td')[0].text
    desc = record('td')[1].text
    doclink = record('td')[2].text
    alink = record('td')[2].find("a")
    if alink :
        doclinkurl=testwebsite+alink['href']
    closingdate = record('td')[3].text
    detail = record('td')[4].text
    detail = detail.replace('\n', '')
    brief = record('td')[5].text
    brief = brief.replace('\n', '')
    print(tnum, desc, doclink, doclinkurl, closingdate, detail, brief)
    testwebsitetendersaved="{},{},{},{},{},{}\n".format(tnum, desc, doclink, doclinkurl, closingdate, detail, brief)
    file.write(bytes(testwebsitetendersaved, encoding="ascii",errors='ignore'))
file.close()

我的输出是

123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me  you next week
123456 text 123 Documents click here  https://example.com/downloads/downloads/linktofile.zip 2 weeks me next week
123456 text 123 Documents click here  https://example.com/downloads/linktofile.zip 2 weeks me next week

BeaufitulSoup删除表格中的br并添加完整链接

！/ usr / bin / env python

- - 编码：utf-8 - -

1 个答案: