我正在尝试使用python读取网页并以csv格式保存数据以导入为pandas dataframe。
我有以下代码从所有页面中提取链接,而不是尝试读取某些列字段。
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
import urllib2
from bs4 import BeautifulSoup
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
print i, anchor.text
except:
pass
我可以将这9列保存为pandas dataframe吗?
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
答案 0 :(得分:1)
这会返回前10页的正确结果 - 但是100页需要花费很多时间。有什么建议让它更快?
import urllib2
from bs4 import BeautifulSoup
finallist=list()
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
mylist=list()
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
mylist.append(anchor.text)
finallist.append(mylist)
except:
pass
import pandas as pd
df=pd.DataFrame(finallist)
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)