我有一个独特的问题,我认为我已经解决了,直到我使用While Loop来控制这个程序的流程。
梗概:
我有一个平面文件(CSV或文本),其中包含一些我想要抓取的网址,使用BeautifulSoup(有效)将新标记附加到HTML,然后将每个已删除的网址保存到新文件名。
我需要的是什么:
我非常肯定这与我无法理解基础有关,我仍然试图绕过这个。这是我的代码:
什么是错的:
使用Python3,代码实际上可行,我使用Jupyter逐行观察代码和一系列print语句,以查看While循环运行时返回的内容。
问题是只保存了一个文件,文件末尾的URL是唯一保存的文件。其他网址已被删除。
在进入下一行之前,如何让每一行迭代并刮擦以保存唯一?我错误地使用这些结构吗?
网址:
https://www.imgacademy.com/media/headline/img-academy-u19-girls-win-fysa-state-cup-u19-championship
代码:
import csv
import requests
from bs4 import BeautifulSoup as BS
filename = 'urls.csv'
with open(filename, 'r+') as file:
while True:
line = file.readline()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'
headers = {'User-Agent':user_agent}
response = requests.get(line, headers)
print(response)
soup = BS(response.content, 'html.parser')
html = soup
title = soup.find('title')
meta = soup.new_tag('meta')
meta['name'] = "robots"
meta['content'] = "noindex, nofollow"
title.insert_after(meta)
for i
with open('{}'".txt".format("line"), 'w', encoding='utf-8') as f:
outf.write(str(html))
if (line) == 0:
break
答案 0 :(得分:0)
filename = 'urls.csv'
with open(filename, 'r+') as file:
#line = line.replace('\n', '')
print(line)
for index, line in enumerate(file):
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'
headers = {'User-Agent':user_agent}
print(headers)
response = requests.get(line, headers)
print(response)
soup = BS(response.content, 'html.parser')
html = soup
title = soup.find('title')
meta = soup.new_tag('meta')
meta['name'] = "robots"
meta['content'] = "noindex, nofollow"
title.insert_after(meta)
with open('{}.html'.format(line[41:]), 'w', encoding='utf-8') as f:
f.write(str(html))