myfile.txt中的长网址必须是短网址。这是在myfile.txt中:
26-04-2018 | Publication 2018, 88936 , https://search.publications.com/pgm-2018-88936.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=0&sorttype=1&sortorder=4
19-04-2018 | Publication 2018, 8168 , https://search.publications.com/pgm-2018-8168.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=1&sorttype=1&sortorder=4
26-03-2018 | Publication 2018, 611724 , https://search.publications.com/pgm-2018-611724.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=2&sorttype=1&sortorder=4
01-02-2017 | Publication 2017, 1452026 , https://search.publications.com/pgm-2017-1452026.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=3&sorttype=1&sortorder=4
在python 2.7中有以下代码:
import re
with open('myfile.txt', 'r+') as myfile:
data = myfile.read()
url = re.findall(r'[^https.+?]', data)
urlshort = re.findall(r'[^https.+html?]', data)
for url in data:
myfile.write(url.replace(url, urlshort, data))
myfile.close()
输出结果为:
追踪(最近一次通话): 文件" /pyscripts/data.py",第9行,in myfile.write(url.replace(url,urlshort,data)) TypeError:需要一个整数
如何在文件中进行此操作?
答案 0 :(得分:1)
将{strong> re.sub
与(https.*html).*
import re
s = """
26-04-2018 | Publication 2018, 88936 , https://search.publications.com/pgm-2018-88936.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=0&sorttype=1&sortorder=4
19-04-2018 | Publication 2018, 8168 , https://search.publications.com/pgm-2018-8168.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=1&sorttype=1&sortorder=4
26-03-2018 | Publication 2018, 611724 , https://search.publications.com/pgm-2018-611724.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=2&sorttype=1&sortorder=4
01-02-2017 | Publication 2017, 1452026 , https://search.publications.com/pgm-2017-1452026.html?search=%3fzkt%3dextended%26pst%3dPublication%26vrt%3d%26zkd%3dInTitle%26dpr%3dAll%26spd%3d20180529%26epd%3d20180529%26sdt%3dDatePublication%26pubId%3d%26pnr%3d1%26rpp%3d10&resultInx=3&sorttype=1&sortorder=4
"""
print(re.sub(r'(https.*html).*', r'\1', s))
输出:
26-04-2018 | Publication 2018, 88936 , https://search.publications.com/pgm-2018-88936.html
19-04-2018 | Publication 2018, 8168 , https://search.publications.com/pgm-2018-8168.html
26-03-2018 | Publication 2018, 611724 , https://search.publications.com/pgm-2018-611724.html
01-02-2017 | Publication 2017, 1452026 , https://search.publications.com/pgm-2017-1452026.html
这样您就可以将re.sub
的整个结果写入您的文件,而不是尝试替换您当前的方式。