我正在尝试抓取一组网页。当我直接从一个网页抓取时,我可以访问html。但是,当我遍历一个pd数据帧以刮取一组网页,甚至是只有一行的数据帧时,我都会看到html被截断并且无法提取所需的数据。
遍历1行的数据框:import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
first_names = pd.Series(['Robert'], index = [0])
last_names = pd.Series(['McCoy'], index = [0])
names = pd.DataFrame(columns = ['first_name', 'last_name'])
names['first_name'] = first_names
names['last_name'] = last_names
freq = []
for first_name, last_name in names.iterrows():
url = "https://zbmath.org/authors/?q={}+{}".format(first_name,
last_name)
r = requests.get(url)
html = BeautifulSoup(r.text)
html=str(html)
frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
freq.append(frequency)
print(freq)
[[]]
直接访问网页。相同的代码,但现在未被阻止。url = "https://zbmath.org/authors/?q=robert+mccoy"
r = requests.get(url)
html = BeautifulSoup(r.text)
html=str(html)
frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
freq.append(frequency)
print(freq)
[[],['10','8','6','5','3','3','2','2','2','2','2 ','1','1','1','1','1','1','1','1','1','1','1','1', '1','1']]
如何循环浏览多个网页但不会被阻止?
答案 0 :(得分:0)
Iterrows返回一个(index,(columns))元组,因此解决方案是对它的解析略有不同:
for _,(first_name, last_name) in names.iterrows():
url = "https://zbmath.org/authors/?q={}+{}".format(first_name,
last_name)
r = requests.get(url)
html = BeautifulSoup(r.text)
html=str(html)
frequency = re.findall('Joint\sPublications">(.*?)</a>', html)
freq.append(frequency)
print(freq)