如何在webscraping时处理有问题的编码?

时间:2016-08-23 21:46:30

标签: python python-2.7 unicode encoding web-scraping

我试图在单独的webpage上抓取并合并多个表的内容。我刚刚阅读了很多关于encodingunicode及其所有链接的内容,但无法弄清楚我是否遗漏了某些内容或者网页上的编码是否存在问题。在第一个链接中,您可以看到2014年10月31日的“品牌名称”列中显示“Pear's Gourmet”,但很多其他字符串都出现了有趣的撇号,例如“Children's Medical Ventures,LLC”(而不是“儿童......”。我可以在IPython中看到有趣的撇号,但它们只是在csv文件中出现了。

我的问题是:

  1. 我是否在编码时出错,导致撇号错误?
  2. 如果没有,我如何用撇号替换错误的字符?
  3. 我试图在下面制作可重现的代码。

    #Import libraries
    import sys
    #import IPython
    print(sys.version_info[0:30])
          #python 2.7.11
    #print(IPython.version_info)
          #IPython 4.0.1
    import pandas as pd
    from bs4 import BeautifulSoup
    #from lxml import html
    import requests
    import os
    cwd = os.getcwd()
    
    #Generate dataframe and lists
    df = pd.DataFrame()
    A=[]
    B=[]
    C=[]
    D=[]
    E=[]
    F=[]
    
    #Scrape the number of separate webpages that contain tables for a given year
    pstr1 = "http://www.fda.gov/Safety/Recalls/ArchiveRecalls/"    
    #for i in range(2006,2017):
    for i in range(2014,2015):  
        a = ["/default.htm","/default.htm?Page="]
        pagename = pstr1 + str(i) + a[0]
        print pagename
        r = requests.get(pagename)
        r.raise_for_status()
        #print(page.encoding)
        r.encoding = 'utf-8'
        page = BeautifulSoup(r.text)
        nPages = page.select('.pagination-clean a') 
    
        #Scrape the data from each table and combine it into a dataframe
        for j in range(len(nPages)):
            pagename = pstr1 + str(i) + a[1] + str(j+1)
            print pagename
            r = requests.get(pagename)
            r.encoding = 'utf-8'
            soup = BeautifulSoup(r.text)
            T1=soup.find('table')
    
            for row in T1.findAll("tr"):
                cells = row.findAll('td')
    
                if len(cells)!=0: #ignore heading 
                    A.append(cells[0].find(string=True))
                    B.append(cells[1].find(string=True))
                    C.append(cells[2].find(string=True))
                    D.append(cells[3].find(string=True))
                    E.append(cells[4].find(string=True))
                    F.append(cells[5].find(string=True))
    
                    #Examine the problematic characters
                    try:
                        cells[1].find(string=True).decode('utf-8')
                        #print "string is UTF-8, length %d bytes" % len(cells[1].find(string=True))
                    except UnicodeError:
                        print "string is not UTF-8"
                        #print(cells[1].find(string=True))
    
    df=pd.DataFrame(A, columns=['Date'])
    df['Brand_Name']=B
    df['Product_Description']=C
    df['Reason_Problem']=D
    df['Company']=E
    df['Details_Photo']=F
    df.to_csv(cwd+'/Table1.csv', encoding='utf-8')
    

0 个答案:

没有答案