Question

I'm working on a scraper for a number of chinese documents. As part of the project I'm trying to scrape the body of the document into a list and then write an html version of the document from that list (the final version will include metadata as well as the text, along with a folder full of individual html files for the documents).

I've managed to scrape the body of the document into a list and then use the contents of that list to create a new HTML document. I can even view the contents when I output the list to a csv (so far so good....). Unfortunately the HTML document that is output is all "\u6d88\u9664\u8d2b\u56f0\u3001\".

Is there a way to encode the output so that this won't happen? Do I just need to grow up and scrape the page for real (parsing and organizing it <p> by <p> instead of just copying all of the exiting HTML as is) and then build the new HTML page element by element?

Any thoughts would be most appreciated.

from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv

#initiates the dictionary to hold the output

holder = []

#this is the target URL
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm"

data = []

filename = "fullbody.html"
target = open(filename, 'w')

def bodyscraper(url):
    #opens the url for read access
    this_url = urllib.urlopen(url).read()
    #creates a new BS holder based on the URL
    soup = BeautifulSoup(this_url, 'lxml')

    #finds the body text
    body = soup.find('td', {'class':'b12c'})


    data.append(body)

    holder.append(data)

    print holder[0]
    for item in holder:
        target.write("%s\n" % item)

bodyscraper(target_url)


with open('bodyscraper.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(holder)

Answer 1

As the source htm is utf-8 encoded, when using bs just decoding what urllib returns which will work. I have tested both of html and csv output will show Chinese characters, here is the amended code:

from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv

#initiates the dictionary to hold the output

holder = []

#this is the target URL
target_url = "http://www.gov.cn/zhengce/content/2016-12/02/content_5142197.htm"

data = []

filename = "fullbody.html"
target = open(filename, 'w')

def bodyscraper(url):
    #opens the url for read access
    this_url = urllib.urlopen(url).read()
    #creates a new BS holder based on the URL
    soup = BeautifulSoup(this_url.decode("utf-8"), 'lxml') #decoding urllib returns

    #finds the body text
    body = soup.find('td', {'class':'b12c'})
    target.write("%s\n" % body) #write the whole decoded body to html directly


    data.append(body)

    holder.append(data)


bodyscraper(target_url)


with open('bodyscraper.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(holder)

HTML scraper output stuck in utf-8

1 个答案: