Question

我正在尝试从包含网站源代码的文本文件中提取URL。我想获取href内的网站链接，并写了一些我从stackoverflow借来的代码，但无法正常工作。

with open(sourcecode.txt) as f:
    urls = f.readlines()

urls = ([s.strip('\n') for s in urls ]) 

print(url)

Answer 1

您可以为此使用正则表达式。

import re

with open('sourcecode.txt') as f:
    text = f.read()

href_regex = r'href=[\'"]?([^\'" >]+)'
urls = re.findall(href_regex, text)

print(urls)

您可能会遇到类似'sourcecode' is not defined的错误；这是因为您传递给open()的参数必须是字符串（请参见上文）

Answer 2

使用正则表达式，您可以从文本文件中提取所有网址，而无需逐行循环：

import re
with open('/home/username/Downloads/Stack_Overflow.html') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])

从文本文件中提取URL-Python

2 个答案: