跟踪ebay搜索结果中的单词频率

时间:2015-12-28 01:20:51

标签: python-3.x xml-parsing beautifulsoup ebay

使用python 3.5,我想要做的是通过生成链接转到ebay搜索的结果页面,将源代码保存为xml文档,并通过每个列表进行迭代,其中包括是1000或更多。接下来我想创建一个字典,其中包含每个列表标题中出现的每个单词(仅限标题)及其相应的出现频率。所以例如,如果我搜索'honda civic',并且三十个结果是'本田思域点火开关',我想我的结果出来了 results = {'honda':70, 'civic':60, 'igntion':30, 'switch':30, 'jdm':15, 'interior':5} 等等。

继承人我使用的链接: http://www.ebay.com/sch/Car-Truck-Parts-/6030/i.html?_from=R40&LH_ItemCondition=4&LH_Complete=1&LH_Sold=1&_mPrRngCbx=1&_udlo=100&_udhi=700&_nkw=honda+%281990%2C+1991%2C+1992%2C+1993%2C+1994%2C+1995%2C+1996%2C+1997%2C+1998%2C+1999%2C+2000%2C+2001%2C+2002%2C+2003%2C+2004%2C+2005%29&_sop=16

我遇到的问题是我只得到前50个结果,而不是X,000的结果,我可能会得到不同的搜索选项。什么是更好的方法来解决这个问题?

和我的代码:

import requests
from bs4 import BeautifulSoup
from collections import Counter

r = requests.get(url)
myfile = 'c:/users/' + myquery
fw = open(myfile + '.xml', 'w')

soup = BeautifulSoup(r.content, 'lxml')
for item in soup.find_all('ul',{'class':'ListViewInner'}):
    fw.write(str(item))
fw.close()
print('...complete')

fr = open(myfile + '.xml', 'r')
wordfreq = Counter()
for i in fr:
    words = i.split()
    for i in words:
        wordfreq[str(i)] = wordfreq[str(i)] + 1

fw2 = open(myfile + '_2.xml', 'w')
fw2.write(str(wordfreq))
fw2.close() 

1 个答案:

答案 0 :(得分:0)

您获得前50个结果,因为EBay每页显示50个结果。解决方案是解析一页。通过此搜索,您可以使用其他网址:

http://www.ebay.com/sch/Car-Truck-Parts-/6030/i.html?_from=R40&LH_ItemCondition=4&LH_Complete=1&LH_Sold=1&_mPrRngCbx=1&_udlo=100&_udhi=700&_sop=16&_nkw=honda+%281990%2C+1991%2C+1992%2C+1993%2C+1994%2C+1995%2C+1996%2C+1997%2C+1998%2C+1999%2C+2000%2C+2001%2C+2002%2C+2003%2C+2004%2C+2005%29&_pgn=1&_skc=50&rt=nc

注意网址中的参数_pgn=1?这是当前显示的页面编号。如果您提供的数字超过了搜索页面的数量,则会在类"sm-md"的div中显示错误消息

所以你可以这样做:

page = 1
url = """http://www.ebay.com/sch/Car-Truck-Parts-/6030/i.html?_from=R40&LH_ItemCondition=4&LH_Complete=1&LH_Sold=1&_mPrRngCbx=1&_udlo=100&_udhi=700&_sop
  =16&_nkw=honda+%281990%2C+1991%2C+1992%2C+1993%2C+1994%2C+1995%2C+1996%2C+
  1997%2C+1998%2C+1999%2C+2000%2C+2001%2C+2002%2C+2003%2C+2004%2C+2005%29&
  _pgn="""+str(page)+"&_skc=50&rt=nc"

has_page = True

myfile = 'c:/users/' + myquery
fw = open(myfile + '.xml', 'w')

while has_page:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")
    error_msg = soup.find_all('p', {'class':"sm-md"})
    if len(error_msg) > 0:
        has_page = False
        continue
    for item in soup.find_all('ul',{'class':'ListViewInner'}):
        fw.write(str(item))
        page+=1

fw.close()

我只测试了输入页面并打印ul,它运行良好