Question

大家好，我目前正在尝试从网址中获取一些数据，然后尝试预测该文章应属于哪个类别。到目前为止，我已经做到了，但是有一个错误：

    info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

错误是这样的：

IndexError：单个位置索引器超出范围。

有人可以帮我吗？

Answer 1

您可以避免iloc调用，而改用iterrows，并且我认为您必须使用loc而不是iloc，因为您正在操作索引，但是使用了{{ 1}}和iloc循环通常效率不高。您可以尝试以下代码（插入等待时间）：

loc

如果您确实只需要循环中的网址，请替换：

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

类似：

for i, row in info.iterrows():
    url= row.iloc[0]

Answer 2

该错误很可能是由于将索引传递给iloc引起的：loc期望索引值和列名，而iloc期望行和列的数字位置。此外，您已经将category与category.append(info.iloc[0,i])的行和列位置互换了。因此，您至少应该这样做：

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

但是当您尝试迭代数据框的第一列时，以上代码不是Pythonic。最好直接使用该列：

for url in info.loc[:, 0]:
    response = requests.get(url)
    ...
    category.append(url)

从url获取数据并将其放入DataFrame

2 个答案: