我想从网站抓取数据,该网站按不同的省份排序。为了更快,我尝试运行两个或更多python脚本来同时抓取每个省。两个脚本之间的唯一区别是它们抓取不同的URL集。每次他们在前30秒或1分钟都运作良好。但后来我在每个脚本中都出现了以下错误,并且每次都会同时出现错误:
Traceback (most recent call last):
File "EOLGrades-A.py", line 157, in <module>
dealEachCollege(url_2,tableName2,cNameList[i])
File "EOLGrades-A.py", line 58, in dealEachCollege
insertData(getData(sp),tableName,collegeName)
File "EOLGrades-A.py", line 33, in getData
if FAtd[x+j].text == '--' or FAtd[x+j].text ==' ':
IndexError: list index out of range
Traceback (most recent call last):
File "EOLGrades-B.py", line 157, in <module>
dealEachCollege(url_2,tableName2,cNameList[i])
File "EOLGrades-B.py", line 58, in dealEachCollege
insertData(getData(sp),tableName,collegeName)
File "EOLGrades-B.py", line 33, in getData
if FAtd[x+j].text == '--' or FAtd[x+j].text ==' ':
IndexError: list index out of range
我的getData方法:
def getData(soup,count):
FAtr = soup.find_all(name='tr')
FAtd = soup.find_all(name='td')
m = [([0] * 6) for i in range(count)]
x = 0
if(len(FAtd)<=1):
print("no data")
return ['0']
else:
for i in range(count):
for j in range(6):
if FAtd[x+j].text == '--' or FAtd[x+j].text ==' ':
m[i][j] = None
else:
content = FAtd[x+j].text.strip()
m[i][j] = content
x= x+6
return m
count来自getCount方法:
def getCount(soup):
FAtr = soup.find_all(name='tr')
count = len(FAtr) - 1
return count
dealEachCollege方法:
def dealEachCollege(URL,tableName,collegeName):
page = s.get(URL,headers=headers)
page.encoding='utf-8'
sp = BeautifulSoup(page.text,"html.parser")
count = getCount(sp)
insertData(getData(sp,count),tableName,collegeName,count)
page.close()
insertData方法:
def insertData(dataList,tableName,collegeName,count):
try:
m=dataList
if m[0]=='0':
return
for i in range(count):
cursor.execute("INSERT INTO " + tableName + " VALUES (%s,%s,%s,%s,%s,%s,%s)",(collegeName,m[i][0],m[i][1],m[i][2],m[i][3],m[i][4],m[i][5]))
conn.commit()
print ("Successfully inserted into %s." %tableName)
except pymysql.Error as e:
print ("Mysql Error %d: %s" %(e.args[0], e.args[1]))
当我运行一个脚本时,没有出现错误。 谁能告诉我如何解决它?或者同时运行2个python爬虫的任何其他方法?非常感谢!