Question

我构建了一个蜘蛛来抓取一个新闻网站，我必须将已删除的项目传递给“Clear_html”函数对剪切的字符串进行一些编辑，但似乎我有一些Unicode问题：这是功能：

def clear_html(html):
    text = re.sub(r'<(style).*?</\1>(?s)|<[^>]*?>|\n|\t|\r', '', html)
    text = text.encode("utf-8")
    textlist= text.split()
    money=[]
    for word in textlist:
        if re.search('^P', word) and re.search('[0-9]', word[1:]):
            x = textlist.index(word) +1 
            y = str(word)+" " +textlist[x]
            money.append(y)
   delete=[]
   for word in textlist:
        if word.startswith("("):
             if word[1].isupper() and word[2].isupper():
                 delete.append(str(word))
             else:
                 pass
   for word in textlist:
       if word == "Image" and textlist[textlist.index(word)+1] == "courtesy":
             imageindex= textlist.index(word)
             dl= textlist[imageindex:]
             for d in dl:
                 delete.append(str(d))

    replacements ={}
    for item in delete:
         replacements[str(item)]=''
    for item in money:
        eitem = item.replace("P","")
        eitem = eitem.split()
        if re.search('^billion', eitem[1]):
            replacements[str(item)]= eitem[0]+" billion"+" Phillippine peso"
        elif re.search('^million', eitem[1]):
            replacements[str(item)]= eitem[0]+" million"+" Phillippine peso"
        else:
            replacements[str(item)]= eitem[0]+" Phillippine peso"
    def multireplace(s, replacements):
        substrs = sorted(replacements, key=len, reverse=True)
        regexp = re.compile('|'.join(map(re.escape, substrs)))
        return regexp.sub(lambda match: replacements[match.group(0)],s)
    text = multireplace(text, replacements)
    return text

这是输出：

港口运营商国际集装箱码头服务公司通过其全资子公司ICTSI Oregon，Inc。和Port of 波特兰已同意在3月份终止25年的租赁协议在港口6号航站楼运营集装箱设施。

我不需要这样的“\ u2019”出现在最后一行我需要它作为端口的终端6。有人可以帮忙吗？

如何设置这个蜘蛛输出的Unicode？

0 个答案: