我在文件中有几个链接。我想遍历每个链接的网页(源代码),从该页面获取第443行(其中包含如下所示的具体细节),并将其与相应的链接一起写入另一个文件。
输入文件:
http://abc/app/application_144733409001
http://abc/app/application_144733409001
http://abc/app/application_144733409000
http://abc/app/application_144733409003
http://abc/app/application_144733409005
http://abc/app/application_144733409008
http://abc/app/application_144733409009
http://abc/app/application_144733409006
预期输出文件:
http://abc/app/application_144733409001 31098 MB-seconds,3 vcore-seconds
http://abc/app/application_144733409001 31098 MB-seconds,2 vcore-seconds
http://abc/app/application_144733409000 31098 MB-seconds,3 vcore-seconds
http://abc/app/application_144733409003 31098 MB-seconds,5 vcore-seconds
http://abc/app/application_144733409005 31798 MB-seconds,7 vcore-seconds
http://abc/app/application_144733409008 31018 MB-seconds,3 vcore-seconds
http://abc/app/application_144733409009 31097 MB-seconds,3 vcore-seconds
http://abc/app/application_144733409006 31094 MB-seconds,3 vcore-seconds
代码:
import sys
import urllib
Lines = [Line.strip() for Line in open ('input.txt','r').readlines()]
with open('/home/try/intermediate.txt', 'w') as out_file:
for Line in Lines:
page = urllib.urlopen(line).read()
#print page
我不知道如何继续。请帮助我。提前致谢
答案 0 :(得分:1)
使用re
检查匹配字符串的行
https://regex101.com/r/nU3xW1/1
for line in Lines:
remoteLine = urllib.urlopen(line)
for l in remoteLine:
matchObj = re.match(r'(\d+) MB-seconds, (\d+) vcore-seconds', l)
if matchObj:
print "matchObj.group() : ", matchObj.group()