大家好,我目前正在尝试从网址中获取一些数据,然后尝试预测该文章应属于哪个类别。 到目前为止,我已经做到了,但是有一个错误:
info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i in info.index:
response = requests.get(info.iloc[i,0])
soup = BeautifulSoup(response.text, 'html.parser')
html.append([re.sub(r'<.*?>','',
str(soup.findAll(['p','h1','\href="/avtorji/'])))])
category.append(info.iloc[0,i])
data = pd.DataFrame()
data['html'] = html
data['category'] = category
错误是这样的:
IndexError:单个位置索引器超出范围。
有人可以帮我吗?
答案 0 :(得分:1)
您可以避免iloc调用,而改用iterrows
,并且我认为您必须使用loc
而不是iloc
,因为您正在操作索引,但是使用了{{ 1}}和iloc
循环通常效率不高。您可以尝试以下代码(插入等待时间):
loc
如果您确实只需要循环中的网址,请替换:
import time
info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
url= row.iloc[0]
time.sleep(2.5) # wait 2.5 seconds
response = requests.get(url) # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
soup = BeautifulSoup(response.text, 'html.parser')
html.append([re.sub(r'<.*?>','',
str(soup.findAll(['p','h1','\href="/avtorji/'])))])
# the following iloc was probably raising the error, because you access the ith column in the first row of your df
# category.append(info.iloc[0,i])
category.append(row.iloc[0]) # not sure which field you wanted to access here, you should also replace it by row['name']
data = pd.DataFrame()
data['html'] = html
data['category'] = category
类似:
for i, row in info.iterrows():
url= row.iloc[0]
答案 1 :(得分:1)
该错误很可能是由于将索引传递给iloc
引起的:loc
期望索引值和列名,而iloc
期望行和列的数字位置。此外,您已经将category
与category.append(info.iloc[0,i])
的行和列位置互换了。因此,您至少应该这样做:
for i in range(len(info)):
response = requests.get(info.iloc[i,0])
...
category.append(info.iloc[i,0])
但是当您尝试迭代数据框的第一列时,以上代码不是Pythonic。最好直接使用该列:
for url in info.loc[:, 0]:
response = requests.get(url)
...
category.append(url)