我在这里做了一些问题,一个人给了我这个代码。但我需要帮助,因为它只带来了我的sites.txt的一个结果
Crawler.py
import urllib.request
import re
regex = "<title>(.+?)</title>"
pattern = re.compile(regex)
txtfl = open('websites.txt')
webpgsinfile = txtfl.readlines()
urls = webpgsinfile
htmlfile = urllib.request.urlopen(urls[i])
htmltext = htmlfile.read().decode('utf8')
titles = re.findall(pattern,htmltext)
if len(titles) > 0:
print(titles[0])
i+=1
sites.txt
http://youtube.com
http://bigsolutions.com.br
答案 0 :(得分:0)
我仍然是一名python2程序员,所以请原谅任何错误。另请注意,此代码未经测试,只是为了让您了解自己需要做什么。
import urllib.request
import re
regex = "<title>(.+?)</title>"
pattern = re.compile(regex)
urls = open('websites.txt').readlines()
titles = []
for url in urls:
htmlfile = urllib.request.urlopen(url)
htmltext = htmlfile.read().decode('utf8')
titles.append(re.findall(pattern, htmltext))
print(titles)
这样做会创建一个你想要的titles
数组,然后通过你的网址迭代并将标题添加到titles
数组中。我没有看到原始代码是如何编译的,但看起来它似乎缺少一个循环。
答案 1 :(得分:0)
import re
from urllib.request import urlopen
def get_page(url, encoding='utf-8'):
return urlopen(url).read().decode(encoding, errors='ignore')
def get_title(txt, reg=re.compile('<title>(.*)</title>', re.IGNORECASE | re.DOTALL)):
match = reg.search(txt)
if match is None:
return ''
else:
return match.group(1).strip()
def main():
with open('websites.txt') as inf:
urls = [line.strip() for line in inf]
titles = [get_title(get_page(url)) for url in urls if url]
print(titles)
if __name__=="__main__":
main()
结果
["LimeCD - Lime's Code Library", 'YouTube', 'Big Solutions - Aqui nós pensamos grande!']