我写的for循环有问题,我无法通过for循环返回到第一个for语句:
def output(query,page,max_page):
"""
Parameters:
query: a string
max_page: maximum pages to be crawled per day, integer
Returns:
List of news dictionaries in a list: [[{...},{...}..],[{...},]]
"""
news_dicts_all = []
news_dicts = []
# best to concatenate urls here
date_range = get_dates()
for date in get_dates():
s_date = date.replace(".","")
while page < max_page:
url = "https://search.naver.com/search.naver?where=news&query=" + query + "&sort=0&ds=" + date + "&de=" + date + "&nso=so%3Ar%2Cp%3Afrom" + s_date + "to" + s_date + "%2Ca%3A&start=" + str(page)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
req = requests.get(url,headers=header)
cont = req.content
soup = BeautifulSoup(cont, 'html.parser')
for urls in soup.select("._sp_each_url"):
try:
if urls["href"].startswith("https://news.naver.com"):
news_detail = get_news(urls["href"])
adict = dict()
adict["title"] = news_detail[0]
adict["date"] = news_detail[1]
adict["company"] = news_detail[3]
adict["text"] = news_detail[2]
news_dicts.append(adict)
except Exception as e:
continue
page += 10
news_dicts_all.append(news_dicts)
return news_dicts_all
我已经执行了代码,看来page +=
将代码返回到“ while
”部分,但是在页面到达{ {1}}。
我本质上想要的是代码在到达for date in get_dates()
之后返回到max_page
,但是我不知道该怎么做。
答案 0 :(得分:1)
您永远不会重置page
,因此当它移至for循环中的下一个日期时,page > max_page
已经为真,因此它将完全跳过while循环。
您将需要执行类似的操作,例如将page
参数更改为start_page
,然后在for循环开始时使用page = start_page
。