Question

我需要获取以下网站列表的所有链接（从转换为列表的数据框列中）：

urls = df['URLs'].tolist()

将每个网址保存在原始数据集的副本中的新列（Links）中。

要从这些网站之一获取信息，我正在使用：

http = httplib2.Http()
status, response = http.request('https://www.farmaciairisdiana.it/blog/') # for example

for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

此代码运行良好（我测试了一些情况）。

我如何迭代每个URL，将结果保存到新列中？

Answer 1

您可以迭代列表urls并将每个链接保存到结果列表。然后创建新的数据框或将此列表添加到新列。

例如：

http = httplib2.Http()

all_links = []
for url in urls:  # `urls` is your list from the question
    status, response = http.request(url)

    for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            all_links.append(link['href'])


new_df = pd.DataFrame({'Links': all_links})
print(new_df)

# or    
#df['Links'] = all_links

编辑：要创建新的数据框，可以使用以下示例：

http = httplib2.Http()

all_links = []
for url in urls:  # `urls` is your list from the question
    status, response = http.request(url)

    l = []
    all_links.append({'URLs': url, 'Links': l})

    for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            l.append(link['href'])


new_df = pd.DataFrame(all_links)
print(new_df)

使用BeautifulSoup在列表中进行网页爬网

1 个答案: