从给定网页中提取特定列

时间:2016-12-14 08:17:58

标签: pandas beautifulsoup bs4

我正在尝试使用python读取网页并以csv格式保存数据以导入为pandas dataframe。

我有以下代码从所有页面中提取链接,而不是尝试读取某些列字段。

for i in range(10):
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
    import urllib2
    from bs4 import BeautifulSoup
    try:
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
            print i, anchor.text
    except:
        pass

我可以将这9列保存为pandas dataframe吗?

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

1 个答案:

答案 0 :(得分:1)

这会返回前10页的正确结果 - 但是100页需要花费很多时间。有什么建议让它更快?

import urllib2
from bs4 import BeautifulSoup

finallist=list()
for i in range(10):
    url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
    try:
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        mylist=list()
        for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]: 
            mylist.append(anchor.text)
        finallist.append(mylist)
    except:
        pass

import pandas as pd
df=pd.DataFrame(finallist)

df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)