写入CSV - set vs list - UnicodeEncodeError

时间:2016-04-18 22:09:55

标签: python csv unicode beautifulsoup

我正在构建一个简单的scraper来学习python。 在下面编写csvWriter函数后,我遇到了问题。似乎编码不能写入csv文件(我假设这是因为我正在抓价格信息)。

另外,我想知道我是否正确认为在这种情况下,最好是从set - >列表,以便在写入之前以我想要的方式压缩和呈现信息。

另外 - 关于我如何接近这一点的任何一般性建议?

from bs4 import BeautifulSoup
import requests
import time
import csv

response = request.get('http://website.com/subdomain/logqueryhere')
baseurl = 'http://website.com'

soup = BeautifulSoup(response.text)
hotelInfo = soup.find_all("div", {'class': "hotel-wrap"})

#retrieveLinks: A function to generate a list of hotel URL's to be passed to the price checker.
def retrieveLinks():
    for hotel in hotelInfo:
        urllist = []
        hotelLink  = hotel.find('a', attrs={'class': ''})
        urllist.append(hotelLink['href'])
        scraper(urllist)

hotelnameset = set()
hotelurlset = set()
hotelpriceset = set()

# Scraper: A function to scrape from the lists generated above with retrieveLinks
def scraper(inputlist):
    global hotelnameset
    global hotelurlset
    global hotelpriceset
    #Use a set here to avoid any dupes.
    for url in inputlist:
        fullurl = baseurl + url
        hotelurlset.add(str(fullurl))
        hotelresponse = requests.get(fullurl)
        hotelsoup = BeautifulSoup(hotelresponse.text)
        hoteltitle = hotelsoup.find('div', attrs={'class': 'vcard'})
        hotelhighprice = hotelsoup.find('div', attrs={'class': 'pricing'}).text
        hotelpriceset.add(hotelhighprice)
        for H1 in hoteltitle:
            hotelName = hoteltitle.find('h1').text
            hotelnameset.add(str(hotelName))
            time.sleep(2)
    csvWriter()


#csvWriter: A function to write the above mentioned sets/lists to a CSV file.
def csvWriter():
    global hotelnameset
    global hotelurlset
    global hotelpriceset
    csvname = list(hotelnameset)
    csvurl = list(hotelurlset)
    csvprice = list(hotelpriceset)
    #lets zip the values we neded (until we learn a better way to do it)
    zipped = zip(csvname, csvurl, csvprice)
    c = csv.writer(open("hoteldata.csv", 'wb'))
    for row in zipped:
        c.writerow(row)

retrieveLinks()

错误如下 -

± |Add_CSV_Writer U:2 ✗| → python main.py 
Traceback (most recent call last):
  File "main.py", line 62, in <module>
    retrieveLinks()
  File "main.py", line 18, in retrieveLinks
    scraper(urllist)
  File "main.py", line 44, in scraper
    csvWriter()
  File "main.py", line 60, in csvWriter
    c.writerow(row)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

1 个答案:

答案 0 :(得分:1)

发布您的实际错误确实会有所帮助!无论如何,在python 2.X中,CSV编写器不会自动为您编码unicode。您基本上必须使用unicodecsv(https://pypi.python.org/pypi/unicodecsv/0.9.0)编写自己的或使用网络上的一个unicode CSV实现(1):

import unicodecsv
def csvWriter():
    global hotelnameset
    global hotelurlset
    global hotelpriceset
    csvname = list(hotelnameset)
    csvurl = list(hotelurlset)
    csvprice = list(hotelpriceset)
    #lets zip the values we neded (until we learn a better way to do it)
    zipped = zip(csvname, csvurl, csvprice)
    with open('hoteldata.csv', 'wb') as f_in:
        c = unicodecsv.writer(f_in, encoding='utf-8')
        for row in zipped:
            c.writerow(row)