当我在Jupyter和Virtual Machine上运行此代码时,它运行顺利。但是当我开始在AWS上运行时,它总是显示列表索引超出范围。我想知道如何解决这个问题。谢谢!
代码:
from datetime import datetime, timedelta
from time import strptime
import requests
from lxml import html
import re
import time
import os
import sys
from pandas import DataFrame
import numpy as np
import pandas as pd
import sqlalchemy as sa
from sqlalchemy import create_engine
from sqlalchemy.sql import text as sa_text
import pymysql
date_list=[]
for i in range(0,2):
duration=datetime.today() - timedelta(days=i)
forma=duration.strftime("%m-%d")
date_list.append(forma)
print(date_list)
def curl_topic_url_hot():
url = 'https://www.xxxx.com/topiclist.php?f=397&p=1'
headers = {'User-Agent': 'aaaaaaaaaaaaaaa'}
response = requests.get(url, headers=headers)
tree = html.fromstring(response.text)
output = tree.xpath("//div[@class='pagination']/a[7]")
maxPage = int(output[0].text)
print('There are', maxPage, 'pages.')
return [maxPage]
topic_url_hot = curl_topic_url_hot()
AWS日志:
['02-12', '02-11']
Traceback (most recent call last):
File "/home/hadoop/ellen_crawl/test0211_mobile.py", line 167, in <module>
topic_url_hot = curl_topic_url_hot()
File "/home/hadoop/ellen_crawl/test0211_mobile.py", line 48, in curl_topic_url_hot
maxPage = int(output[0].text)
IndexError: list index out of range
当我在Jupyter上运行此代码时,它显示:
['02-12', '02-11']
There are 818 pages.
答案 0 :(得分:3)
您可以使用
if len(output) > 1:
maxPage = int(output[0].text)
或
try:
maxPage = int(output[0].text)
except IndexError:
# do sth. with the error message
无论哪种情况,您的原始代码都不会产生您认为会产生的结果。
答案 1 :(得分:3)
您可以通过首先测试并且仅将结果编入索引中,或者通过try / except-catting错误来摆脱错误:
if len(output)>0:
maxPage = int(output[0].text)
try:
maxPage = int(output[0].text)
except IndexError as e:
pass # log it or do smth with it
您的实际问题在其他地方:
您的卷曲不会产生您认为的效果-也许AWS不支持您想要的功能,因此该请求被阻止并且什么都不返回?也许您的网址中有错字?
一些想法:
tree
response
的错误代码答案 2 :(得分:-1)
您的AWS访问此网站,它返回错误html,请检查它。 https://www.xxxx.com/topiclist.php?f=397&p=1