如何删除python中的特殊字符?

时间:2014-01-31 06:24:18

标签: python unicode special-characters decode

由于描述中的特殊字符( ),我在下面的代码中遇到了问题。由于这些特殊字符,我收到错误。

错误:'ascii'编解码器无法解码位置821中的字节0x94:序号不在范围内(128) 请帮我删除此错误。

文本形式的数据和html形式的数据分别在下面给出

链接:http://m.cellularoutfitter.com/p-85870-anycom-solar-bluetooth-car-kit_c.html(在页面末尾有描述)

我尝试了各种方法和编码,但失败了。

首先我获得链接的完整src然后通过使用x路径我得到变量中的描述。 由于某些原因,我无法发布完整的代码。对不起

Python代码:

parser = etree.HTMLParser(remove_blank_text=True, encoding="utf-8")
tree = etree.HTML(popup_html, parser)
    description = tree.xpath("//span[@itemprop='description' and not(src)] ")

log.debug(str(description[0]))
for desc in description:
    log.debug(etree.tostring(desc,encoding='UTF-8'))
    if etree.tostring(desc,encoding='UTF-8').find("IFRAME") < 0:
        reply_dict['product_desc'] = reply_dict['product_desc'] + etree.tostring(desc,encoding='UTF-8')
        reply_dict['product_desc'] = reply_dict['product_desc'].replace("&#13;\n", "").replace("\n", "<br/>").replace("img","").replace('< src="/productPics/altImgs/decal-skin-pdp-2.jpg"/>',"")
        reply_dict['product_desc'] = reply_dict['product_desc'].replace("\xef\xbf\xbd","'")
        reply_dict['product_desc'] = reply_dict['product_desc'].replace("\x92","'")
        reply_dict['product_desc'] = "<br />".join(reply_dict['product_desc'].split("\n")).replace("     ", "&nbsp;").encode('ascii', 'xmlcharrefreplace')

HTML代码:

<div class="centerContain">
            Convenient Bluetooth car kit easily mounts to vehicle windshield and features high-performance solar panel capable of converting UV rays into Bluetooth battery power. What's included: ANYCOM Solar Bluetooth Car Kit, window mount, suction cups, 12/24V vehicle power adapter w/USB cable, 3M adhesive tape, user guide.
            <ul>
                <li>Solar panel recharges battery, providing 30 minutes of talk time for every 3 hours of sun light</li><li>Features Digital Signal Processing (DSP) technology, including compression and echo cancellation</li><li>Easily pairs with compatible devices</li><li>Bluetooth: v2.0</li><li>Talk Time: 15 hours</li><li>Standby Time: 25 days</li><li>Operating Range: 33 ft. (10 meters)</li><li>Size: 3.59� (H) x 1.98� (W) x 0.52� (D)</li><li>Weight: 2.11 oz.</li><li>Warranty: ANYCOM limited worldwide 2-year warranty</li>
            </ul>
        </div>

在文本表格中:

便捷的蓝牙车载套件可轻松安装在车辆挡风玻璃上,并配有能够将紫外线转换为蓝牙电池电源的高性能太阳能电池板。其中包括:ANYCOM太阳能蓝牙车载套件,窗口安装,吸盘,带USB线的12 / 24V车载电源适配器,3M胶带,用户指南。 太阳能电池板为电池充电,每3小时的阳光照射可提供30分钟的通话时间 采用数字信号处理(DSP)技术,包括压缩和回声消除 轻松与兼容设备配对 蓝牙:v2.0 通话时间:15小时 待机时间:25天 工作范围:33英尺(10米) 尺寸:3.59 (高)x1.98 (宽)×0.52 (深) 重量:2.11盎司。 保修:ANYCOM限制全球2年保修

1 个答案:

答案 0 :(得分:0)

代码不清楚,但我认为问题在于解码,解码后应该工作到utf-8

例如,

string = '(10 meters) Size: 3.59� (H) x 1.98� (W) x 0.52� (D) Weight:'.decode('utf-8')