将抓取的标题从网页写入pandas框架

时间:2018-05-08 08:01:09

标签: python beautifulsoup

写下这个代码来下载h1,h2和h3标题并写入一个pandas框架以及一个url列表,但是它给出了错误,因为解包错误需要3个值。

def url_corrector(url):
    if not str(url).startswith('http'):
        return "https://"+str(url)
    else:
        return str(url)

def header_agg(url):
    h1_list = []
    h2_list = []
    h3_list = []
    p = requests.get(url_corrector(url),proxies = proxy_data,verify=False)
    soup = BeautifulSoup(p.text,'lxml')
    for tag in soup.find_all('h1'):
        h1_list.append(tag.text)

    for tag in soup.find_all('h2'):
        h2_list.append(tag.text)

    for tag in soup.find_all('h3'):
        h3_list.append(tag.text)
    return h1_list, h2_list, h3_list

headers_frame = url_list.copy()
headers_frame['H1'],headers_frame['H2'],headers_frame['H3'] = headers_frame.url.map(lambda x: header_agg(x))

有关如何操作的任何帮助? 得到此错误:

ValueError: too many values to unpack (expected 3)

3 个答案:

答案 0 :(得分:1)

让我们假设url_list是一个具有以下结构的dict:

url_list = {'url': [<url1>, <url2>, <url3>, <url4>, ..., <urln>]}

headers_frame.url.map(lambda x: header_agg(x))的调用将返回一个包含以下形式的n个元素的列表:

[<url1(h1_list, h2_list, h3_list)>, <url2(h1_list, h2_list, h3_list)>, ..., <urln(h1_list, h2_list, h3_list)>]

对于生成所需输出的代码,您可能必须将最后一个语句重写为循环

headers_frame.update({'H1':[], 'H2':[], 'H3':[]})
for url in headers_frame.url:
   headers = header_agg(url)
   headers_frame['H1'].extend(headers[0])
   headers_frame['H2'].extend(headers[1])
   headers_frame['H3'].extend(headers[2])

答案 1 :(得分:0)

您必须返回一个实体。只需改变:

return [h1_list, h2_list, h3_list]

答案 2 :(得分:0)

这是否解决了这个问题。但是,仍然不确定原来为什么不起作用。

headers_frame = url_list.copy()
H1=[]
H2=[]
H3=[]
for url in headers_frame.url:
    k = header_agg(url)
    H1.append(k[0])
    H2.append(k[1])
    H3.append(k[2])
pd.DataFrame(np.column_stack([headers_frame.url,H1,H2,H3]))