Python抓取编码问题

时间:2018-02-06 18:27:19

标签: python unicode encoding web-scraping decoding

我正试图使用​​beautifulsoup抓一个网站。我很成功,但有两个问题

  1. 从网站上获取数据后,我将它们打印到屏幕上 将它们写入CSV文件。网站上有一个价格字段 从实际金额中得出的卢比符号(价格的样本结构) 字段:₹10000)。当我将数量打印到控制台时,它打印得很好 没问题。当我尝试将其写入excel表时,我收到错误 " Unicodeencoeerror"编解码器' charmap'不能编码字符' \ u20b9'在 位置28.我正在打印其他领域到控制台并且优秀的问题显示 只有两个字段,一个带有货币符号,另一个带有* 符号

  2. 我有一个循环运行来从网页获取特定的所有页面 搜索。搜索结果大约344页,但循环停在大约页面 43只有HTML错误500作为错误消息

    import bs4
    from urllib.request import urlopen as uReq
    
    from bs4 import BeautifulSoup as Soup
    filename = "data.csv"
    f = open(filename,"w")
    headers = "phone_name, phone_price, phone_rating,number_of_ratings, 
    memory, display, camera, battery, processor, Warrenty, security, OS\n"
    f.write(headers)
    
    
    for i in range(2):      # Number of pages minus one 
            my_url = 'https://www.flipkart.com/search?as=off&as-
            show=on&otracker=start&page=
            {}&q=cell+phones&viewType=list'.format(i+1)
            print(my_url)
    
            uClient=uReq(my_url)
    
            page_html=uClient.read()
    
            page_soup = Soup(page_html,"html.parser")
    
            containers=page_soup.findAll("a", {"class":"_1UoZlX"})
    
    
    
    
    for container in containers:          phone_name        =  
    container.find("div",{"class":"_3wU53n"}).text
    
       try:
       phone_price =  container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
    
       except:
       phone_price           =  'No Data'
    
  3. 非常感谢你的帮助!

1 个答案:

答案 0 :(得分:0)

为Excel编写.CSV文件时,utf8编码应该用于正确支持任何Unicode字符。如果仅使用#!python3 import csv from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as Soup filename = "data.csv" with open(filename,'w',newline='',encoding='utf-8-sig') as f: w = csv.writer(f) headers = 'phone_name phone_price phone_rating number_of_ratings memory display camera battery processor Warrenty security OS' w.writerow(headers.split()) for i in range(2): # Number of pages minus one my_url = 'https://www.flipkart.com/search?as=off&as-show=on&otracker=start&page={}&q=cell+phones&viewType=list'.format(i+1) print(my_url) uClient=uReq(my_url) page_html=uClient.read() page_soup = Soup(page_html,"html.parser") containers=page_soup.findAll("a", {"class":"_1UoZlX"}) for container in containers: phone_name = container.find("div",{"class":"_3wU53n"}).text try: phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text except: phone_price = 'No Data' w.writerow([phone_name,phone_price]) 并且显示字符不正确,Excel将假定Windows上的本地化ANSI编码。

phone_name,phone_price,phone_rating,number_of_ratings,memory,display,camera,battery,processor,Warrenty,security,OS
"Asus Zenfone 3 Laser (Gold, 32 GB)","₹9,999"
"Intex Aqua Style III (Champagne/Champ, 16 GB)","₹3,999"
"iVooMi i1s (Platinum Gold, 32 GB)","₹7,499"
"Xolo ERA 3X (Posh Black, 16 GB)","₹6,999"
"iVooMi Me1 (Sunshine Gold, 8 GB)","₹3,599"
"Panasonic Eluga A4 (Mocha Gold, 32 GB)","₹9,790"
Samsung Metro 313 Dual Sim,"₹2,025"
"Samsung Galaxy J3 Pro (Gold, 16 GB)","₹6,990"
Samsung Guru Music 2,"₹1,625"
"Panasonic Eluga A4 (Marine Blue, 32 GB)","₹9,640"
"Asus Zenfone 4 Selfie (Black, 32 GB)","₹9,999"
Swipe Elite 3- 4G with VoLTE,"₹3,999"
"Asus Zenfone Max (Black, 16 GB)","₹7,486"
Swipe Elite 3- 4G with VoLTE,"₹3,999"
"Swipe Elite Power (Space Grey, 16 GB)","₹5,499"
"Celkon Diamond Mega (Grey, 16 GB)","₹5,499"
"Asus Zenfone Max (Black, 32 GB)","₹7,999"
"Swipe Elite Power (Champagne Gold, 16 GB)","₹5,499"
"Asus Zenfone 4 Selfie (Gold, 32 GB)","₹9,999"
"Karbonn Aura (Champagne, 8 GB)","₹3,199"
"Infinix Note 4 (Ice Blue, 32 GB)","₹8,999"
"Infinix Note 4 (Milan Black, 32 GB)","₹8,999"
"Moto G5s Plus (Blush Gold, 64 GB)","₹15,990"
"Moto G5s Plus (Lunar Grey, 64 GB)","₹15,940"

输出:

{{1}}

Excel中:

enter image description here