Question

所以，每当我在oed.com上网页抓取这个网页时，我都会得到一些似乎是unicode字符的小撇号。如何过滤我的代码并用普通撇号替换所有这些字符？下面是我用来打印单词列表的代码（如果你没有登录网站，多次拼写会显示重复的单词）。

import csv
import os
import re
import requests
import urllib2

year_start= 1550
year_end = 1560
subject_search = ['Law']

with open("/Applications/Python 3.5/Economic/OED_table.csv", 'a') as outputw, open("/Applications/Python 3.5/Economic/OED.html", 'a') as outputh:  #opens the folder and 'a' adds the words to the csv file.    
for year in range(year_start, year_end +1): 
    path = '/Applications/Python 3.5/Economic'
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    urllib2.install_opener(opener)

    user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    header = {'User-Agent':user_agent}

    resultPath = os.path.join(path, 'OED_table.csv')
    htmlPath = os.path.join(path, 'OED.html')
    request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+ str(year)+ '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+ str(subject_search)+ '&type=dictionarysearch', None, header)
    page = opener.open(request)

    urlpage = page.read()
    outputh.write(urlpage)

    new_words = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
    print new_words
    csv_writer = csv.writer(outputw)
    if csv_writer.writerow([year] + new_words): 
        csv_writer.writerow([year, word])

在打印完我的话之后，我经常会得到unicode字母\ xcb \ x88。例如，un'sentenced这个词打印为'un \ xcb \ x88sentenced'。

如何获取这些unicode字母的所有实例并将其替换为适当的撇号＆gt; '。我以为它会是这样的，

for word in new_words:
    word = re.sub('[\x00-\x7f]','', word)

但我被卡住了。

Answer 1

关于这个：在打印完我的话之后，我经常得到unicode字母\ xcb \ x88。例如，单词un＆＃39;被打印为＆＃39; un \ xcb \ x88sentenced＆＃39;。

问题1：\ xcb \ x88不是unicode字母（复数）。它是一个字符U + 02C8 MODIFIER CHARACTER VERTICAL LINE，以UTF-8编码。 Unicode标准暗示它修改了以下字符。

问题2：被判刑不是一个字。

您需要确定此小工具在原始数据中的含义。我的猜测是它不是任何一种撇号。所以你可能需要删除它。

强烈建议：不要删除遇到的每个非ASCII字符。还要读取你的文件，将整个文件从UTF-8解码为unicode，处理unicode，最后编码输出数据......不要尝试处理UTF-8字节。

如何在python中进行webscraping时过滤掉unicode字符？

1 个答案: