如何正确格式化所需的附加输出代码?

时间:2019-05-09 15:03:58

标签: python-2.x

我正在编写新代码,但在获取所需输出时遇到问题。该代码读取html文件并查找标签。它仅输出网址。我插入其他代码以完成链接。我正在尝试在字符串中两次插入url。

####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
    soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href':   re.compile("^https://")}):
    links.append('<a href="'+link.get('href')+'">'"{link}"'</a><br>')

time.sleep(.1)

with  open("page-2.html", 'w') as html:
    html.write('{links}\n'.format(links=links))

2 个答案:

答案 0 :(得分:0)

这让我想我想我想,但不完全是。我宁愿看到它写成“ https://whatever.com/text/text/”,也不愿看到“ whatever.com/text/text”

####### Parse for <a> tags and save ############
with open("page1.html", 'r') as htmlb:
    soup2 = BeautifulSoup(htmlb, 'lxml')
links = []
for link in soup2.findAll('a', attrs={'href':   re.compile("^https://")}):
links.append('{0}</a><br>'.format(link,link))

with  open("page-2.html", 'w') as html:
    html.write('{links}\n'.format(links=links))

答案 1 :(得分:0)

这应该为您提供所需的html输出文件:

import re
from bs4 import BeautifulSoup
import html 

with open("page1.html", 'r') as htmlb:
    soup2 = BeautifulSoup(htmlb, 'lxml')



with open("page2.html", 'w') as h:
    for link in soup2.find_all('a'):
       h.write("<a href=\"{}\">{}</a><br>".format(link.get('href'),link.get('href')))