创建一个通过多个read_html链接循环的数据框

时间:2019-01-06 14:36:01

标签: python pandas dataframe

我是python的新手,正在尝试从网站的多个页面中抓取一个表格。

在阅读了多个网站并观看了视频之后,我设法编写了一个能够抓取单个页面并将其保存为excel的代码。 用于分页的url只是简单地更改url末尾的page = x值。我尝试过无法遍历多个页面并创建一个数据框。

单页抓取

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

urlbase = "https://www.olx.in/coimbatore/?&page=1"
res = requests.get(urlbase)
soup = BeautifulSoup(res.content,'lxml')
table = soup.find('table', id="offers_table")
df = pd.read_html(str(table), header=1)

df[0].rename(index=str, columns={"Unnamed: 0": "Full Desc", "Unnamed: 2": 
"Detail", "Unnamed: 3": "Price", "Unnamed: 4": "Time"}, inplace = True)
df[0].dropna(thresh=3).to_excel('new.xlsx', sheet_name='Page_2', columns= 
['Detail','Price','Time'], index = False)

抓取多个页面

import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

urlbase = "https://www.olx.in/coimbatore/?&page="

for x in range (4)[1:]:
 res = requests.get(urlbase + str(x))

然后通过组合从每个页面创建的多个数据框来创建一个数据框。 我不知道如何在循环中创建多个数据框并将它们组合在一起。

1 个答案:

答案 0 :(得分:1)

您快到了,可以使用:

frames = []
for x in range (4):
    res = requests.get(urlbase + str(x))
    soup = BeautifulSoup(res.content,'lxml')
    table = soup.find('table', id="offers_table")
    df = pd.read_html(str(table), header=1)
    df[0].rename(index=str, columns={"Unnamed: 0": "Full Desc", "Unnamed: 2": 
        "Detail", "Unnamed: 3": "Price", "Unnamed: 4": "Time"}, inplace = True)
    frames.append(df[0].dropna(thresh=3))
res = pd.concat(frames)
res.to_excel('new.xlsx', sheet_name='Page_2', columns= ['Detail','Price','Time'], index = False)