Question

我正在尝试使用beautifulsoup从Craigslist获取数据PID和价格。我写了一个单独的代码，它给我文件CLallsites.txt。在这段代码中，我试图从txt文件中获取每个站点，并获取前10页中所有条目的PID。我的代码是：

  from bs4 import BeautifulSoup       
  from urllib2 import urlopen 
  readfile = open("CLallsites.txt")
  product = "mcy"
  while 1:
    u = ""
    count = 0
    line = readfile.readline()
    commaposition = line.find(',')
    site = line[0:commaposition]
    location = line[commaposition+1:]
    site_filename = location + '.txt'
    f = open(site_filename, "a")
    while (count < 10):
       sitenow = site + "\\" + product + "\\" + str(u)
       html = urlopen(str(sitenow))                      
       soup = BeautifulSoup(html)                
       postings = soup('p',{"class":"row"})
       for post in postings:
            y = post['data-pid']
            print y
       count = count +1
       index = count*100
       u = "index" + str(index) + ".html"
    if not line:
        break
    pass

我的CLallsites.txt看起来像这样：

craiglist网站，位置（Stackoverflow不允许使用cragslist链接发布，因此我无法显示文本，如果有帮助，我可以尝试附加文本文件。）

当我运行代码时，我收到以下错误：

追踪（最近一次呼叫最后一次）：

文件“reading.py”，第16行，in html = urlopen（str（sitenow））

文件“/usr/lib/python2.7/urllib2.py”，第126行，在urlopen中 return _opener.open（url，data，timeout）

文件“/usr/lib/python2.7/urllib2.py”，第400行，处于打开状态 response = self._open（req，data）

文件“/usr/lib/python2.7/urllib2.py”，第418行，在_open中 '_open'，req）

文件“/usr/lib/python2.7/urllib2.py”，第378行，在_call_chain中 result = func（* args）

文件“/usr/lib/python2.7/urllib2.py”，第1207行，在http_open中 return self.do_open（httplib.HTTPConnection，req）

文件“/usr/lib/python2.7/urllib2.py”，第1177行，在do_open中提出URLError（错误）

urllib2.URLError：

关于我做错了什么的任何想法？

Answer 1

我不知道sitenow的内容是什么，但看起来它是无效的网址。请注意，URL使用斜杠而不是反斜杠（因此该语句应与sitenow = site + "/" + product + "/" + str(u)类似）

美丽汤中的错误

1 个答案: