Question

我试图在单独的webpage上抓取并合并多个表的内容。我刚刚阅读了很多关于encoding和unicode及其所有链接的内容，但无法弄清楚我是否遗漏了某些内容或者网页上的编码是否存在问题。在第一个链接中，您可以看到2014年10月31日的“品牌名称”列中显示“Pear's Gourmet”，但很多其他字符串都出现了有趣的撇号，例如“Children's Medical Ventures，LLC”（而不是“儿童......”。我可以在IPython中看到有趣的撇号，但它们只是在csv文件中出现了。

我的问题是：

我是否在编码时出错，导致撇号错误？
如果没有，我如何用撇号替换错误的字符？

我试图在下面制作可重现的代码。

#Import libraries
import sys
#import IPython
print(sys.version_info[0:30])
      #python 2.7.11
#print(IPython.version_info)
      #IPython 4.0.1
import pandas as pd
from bs4 import BeautifulSoup
#from lxml import html
import requests
import os
cwd = os.getcwd()

#Generate dataframe and lists
df = pd.DataFrame()
A=[]
B=[]
C=[]
D=[]
E=[]
F=[]

#Scrape the number of separate webpages that contain tables for a given year
pstr1 = "http://www.fda.gov/Safety/Recalls/ArchiveRecalls/"    
#for i in range(2006,2017):
for i in range(2014,2015):  
    a = ["/default.htm","/default.htm?Page="]
    pagename = pstr1 + str(i) + a[0]
    print pagename
    r = requests.get(pagename)
    r.raise_for_status()
    #print(page.encoding)
    r.encoding = 'utf-8'
    page = BeautifulSoup(r.text)
    nPages = page.select('.pagination-clean a') 

    #Scrape the data from each table and combine it into a dataframe
    for j in range(len(nPages)):
        pagename = pstr1 + str(i) + a[1] + str(j+1)
        print pagename
        r = requests.get(pagename)
        r.encoding = 'utf-8'
        soup = BeautifulSoup(r.text)
        T1=soup.find('table')

        for row in T1.findAll("tr"):
            cells = row.findAll('td')

            if len(cells)!=0: #ignore heading 
                A.append(cells[0].find(string=True))
                B.append(cells[1].find(string=True))
                C.append(cells[2].find(string=True))
                D.append(cells[3].find(string=True))
                E.append(cells[4].find(string=True))
                F.append(cells[5].find(string=True))

                #Examine the problematic characters
                try:
                    cells[1].find(string=True).decode('utf-8')
                    #print "string is UTF-8, length %d bytes" % len(cells[1].find(string=True))
                except UnicodeError:
                    print "string is not UTF-8"
                    #print(cells[1].find(string=True))

df=pd.DataFrame(A, columns=['Date'])
df['Brand_Name']=B
df['Product_Description']=C
df['Reason_Problem']=D
df['Company']=E
df['Details_Photo']=F
df.to_csv(cwd+'/Table1.csv', encoding='utf-8')

如何在webscraping时处理有问题的编码？

0 个答案: