以下是我的刮刀代码,它从该特定页面中提取URL和相应的注释:
import scraperwiki
import lxml.html
from BeautifulSoup import BeautifulSoup
import urllib2
import re
for num in range(1,2):
html_page = urllib2.urlopen("https://success.salesforce.com/ideaSearch?keywords=error&pageNo="+str(num))
soup = BeautifulSoup(html_page)
for i in range(0,10):
for link in soup.findAll('a',{'id':'search:ForumLayout:searchForm:itemObj2:'+str(i)+':idea:recentIdeasComponent:profileIdeaTitle'}):
pageurl = link.get('href')
html = scraperwiki.scrape(pageurl)
root = lxml.html.fromstring(html)
for j in range(0,300):
for table in root.cssselect("span[id='ideaView:ForumLayout:ideaViewForm:cmtComp:ideaComments:cmtLoop:"+str(j)+":commentBodyOutput'] table"):
divx = table.cssselect("div[class='htmlDetailElementDiv']")
if len(divx)==1:
data = {
'URL' : pageurl,
'Comment' : divx[0].text_content()
}
print data
scraperwiki.sqlite.save(unique_keys=['URL'], data=data)
scraperwiki.sqlite.save(unique_keys=['Comment'], data=data)
当数据保存到scraperwiki数据存储区时,只有一个URL中的最后一个注释放入表中。我希望在表中为每个URL保存所有注释。因此,在一列中有URL,在第二列中有来自该URL的所有注释,而不仅仅是最后一条注释,这是此代码最终的结果。
答案 0 :(得分:0)
正如我从您的代码中看到的那样,您将data
置于最内部的for循环中,并且每次都为其分配一个新值。因此,当for循环结束并进入保存步骤时,data
将包含最后一条注释。我想你可以使用:
for i in range(0,10):
for link in soup.findAll('a',{'id':'search:ForumLayout:searchForm:itemObj2:'+str(i)+':idea:recentIdeasComponent:profileIdeaTitle'}):
pageurl = link.get('href')
html = scraperwiki.scrape(pageurl)
root = lxml.html.fromstring(html)
data = {'URL': pageurl, 'Comment':[]}
for j in range(0,300):
for table in root.cssselect("span[id='ideaView:ForumLayout:ideaViewForm:cmtComp:ideaComments:cmtLoop:"+str(j)+":commentBodyOutput'] table"):
divx = table.cssselect("div[class='htmlDetailElementDiv']")
if len(divx)==1:
data['Comment'].append(divx[0].text_content)
scraperwiki.sqlite.save(unique_keys=['URL'], data=data)
scraperwiki.sqlite.save(unique_keys=['Comment'], data=data)