Question

我有一个丑陋的循环，它在爬网一个网站并从该页面上刮下元素，然后将页面输入到变量中，然后将其输出到CSV文件中的特定位置。

for i in range (30):
    page = requests.get('https://www.example.com'+ df.loc[i,'ga:pagePath'])
    tree = html.fromstring(page.content.decode('utf-8'))
    postalcode2 = tree.xpath('//span[@itemprop="postalCode"]/text()')
    postalcode = tree.xpath('//span[@itemprop="addressRegion"]/text()')
    if not postalcode2 and not postalcode:
        df.loc[i,'postcode'] = 'Expired Development'
    elif not postalcode2:
        postalcode4 = postalcode[0]
        postalcode4 = postalcode4.replace(' ','')
        postalcode4 = postalcode4.replace('&nbsp;','')
        df.loc[i,'postcode'] = postalcode4
    elif not postalcode:
        postalcode3 = postalcode2[0]
        if 'Â' not in postalcode3:
            postalcode3 = postalcode3.replace('\\xa0','')
            postalcode3 = postalcode3.replace(' ','')
        else:
            postalcode3 = postalcode3.replace('\\xa0Â','')
            postalcode3 = postalcode3.replace(' ','')
        df.loc[i,'postcode'] = postalcode3
    elif postalcode2 and postalcode:
        postalcode3 = postalcode2[0]
        if 'Â' not in postalcode3:
            postalcode3 = postalcode3.replace('\\xa0','')
            postalcode3 = postalcode3.replace(' ','')
        else:
            postalcode3 = postalcode3.replace('\\xa0Â','')
            postalcode3 = postalcode3.replace(' ','')
        df.loc[i,'postcode'] = postalcode3

这是我在数据框中读取数据的方式（使用熊猫）

files = 'example.csv'
df = pandas.read_csv(files, index_col=0)
df.insert(5,'postcode','')

根据存在的内容，在页面上可以找到两个跨度项目，当显示“ addressRegion”时，脚本会对其进行精细处理并将其正确输出，而无需在CSV中留空格。

但是，当网页上没有'addressRegion'且只有'postalCode'跨度时，格式带有字符Â。这两个span属性之间的唯一区别是'addressRegion'只是一个普通的邮政编码，带有空格，例如“ BH12 8HJ”，但是“ postalCode”具有＆nbsp;当它通过时标记为空间，例如“ BH12＆nbsp; 8HJ”。

当我将其转换为字符串并尝试删除空格时是否会引起问题，但我不明白为什么它会在CSV中创建Â字符。

感谢您的帮助。

P.S在Nbsp标记中有一个空格，以确保在我的问题中实际上没有将其添加为空格：）

Python通过熊猫将奇数字符输出到CSV

0 个答案: