如何在漂亮的汤中使用get_text()时更改unicode格式

时间:2015-02-05 05:52:27

标签: python unicode pandas beautifulsoup

我在使用get_text()时获得unicode格式。 如何在DataFrame中将Unicode更改为字符串?

整洁数据需要适当的文字格式..... 以下是我的代码......

import requests
from pattern import web
from bs4 import BeautifulSoup
from pandas import *
url = 'http://www.mouthshut.com/product-reviews/amazonin-reviews-925670774-srch'
    r = requests.get(url)
    bs = BeautifulSoup(r.text)
    mouthrev = []
    Title = []
    for revlist in  bs.find_all("li","reviewdetails openshare"):
        title = revlist.find_all('div','reviewtitle fl')
        title = [g.get_text(strip=True) for g in title]

        for parent in revlist.find_all("div", itemprop='description'):
            review = parent.find_all('p')
            review = [g.get_text(strip=True) for g in review]
            mouthrev.append(review)
            Title.append(title)


    mouth1 = DataFrame({'Title' : Series(Title),'Review' : Series(mouthrev)})
    mouth1.to_csv('D:\\Review.csv')

我收到了结果:

Title   Review
[u'Wrong product need immediate refund']    [u'I have been shopping with amazon for almost 6 months now and for the 1st time I ordered a Tuxedo. Looking at the item online it seemed perfect. My actual size for the suit is 40 which fits me perfectly. I ordered for the same size. Firstly the delivery didnt happen though I received a text statin ...']
[u'Cheating customers by sending a dummy tracking no.'] [u'Order #171-0709329-6021113( amazon.in)', u'I have placed this order on 15th Jan 2015 and I received a mail from amazon on 15th Jan 2015 itself as my order has shipped. Also I have received a tracking number of Speed Post.', u'Today it is 03rd Feb 2015, till now there is no status/details a...']
[u'BAD in Delivery. Unpredictable delivery date/time.'] [u'If Ordering from Amazon.In, be prepared for Delivery nightmares.', u'The Delivery team does NOT call you up before coming.', u'Amazon does send you Courier persons name and mobile. My experience has been is that this information is not reliable(Happened to me twice that the Delivery person I  ...']

2 个答案:

答案 0 :(得分:1)

如果我理解你为什么不使用str()

review = [str(g.get_text(strip=True)) for g in review]

这将起作用

答案 1 :(得分:0)

这与Unicode无关。 [...]是列表的表示(repr)。您在每个单元格中都有一个列表,因为您正在获取多个p元素的文本:

title = [g.get_text(strip=True) for g in title]
review = [g.get_text(strip=True) for g in review]

如果你想从中形成一个单独的字符串,你可以将多个p文本连接在一起作为行,例如:

review = '\n'.join(g.get_text(strip=True) for g in review)

然后CSV格式化程序将有一个字符串而不是列表,因此它不必尝试通过repr将数据强制转换为字符串。