Python中的清理数据集

时间:2017-03-11 16:52:27

标签: python string csv nlp

我是Python的新手。我有一个CSV文件,其中的推文条目格式如下:

  

15,十月   11,785816454042124288,/ realDonaldTrump /状态/ 785816454042124288,FALSE"尽管   在山体滑坡(每次民意调查)中赢得第二次辩论,很难   当Paul Ryan和其他人给予零支持时做得好!",DonaldTrump

和另一个

  

16,十月   10,785563318652178432,/ realDonaldTrump /状态/ 785563318652178432,FALSE"哇,   @CNN被抓到了他们的#34;"焦点小组""为了使弯曲   希拉里看起来更好。非常可怜而完全   !不诚实",唐纳德·特朗普

在Python中,我使用像这样的Pandas加载内容:

data = pd.read_csv(arg, sep=',')

现在,我想清理CSV文件,只保存用户ID(每行第3个条目)和推文本身(我认为第6行)。如你所见,我使用sep =','分开。问题是如果一些推文包含逗号,我不希望由于分裂而删除此字符..如果仅推文号,日期,user_id等之间的分隔符将是除逗号之外的其他内容,这会容易得多。有关如何做到这一点的任何建议?我只想要一个没有我不需要的信息的新CSV文件。

1 个答案:

答案 0 :(得分:0)

  

问题是如果有些推文包含逗号,我不希望因分裂而删除此字符..

常规Python标准库CSV module可以很好地处理这种情况:

>>> import csv
>>> s = '''15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump
16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump
'''.splitlines()
>>> for fields in csv.reader(s):
        print(fields[2], fields[5])


785816454042124288 Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!
785563318652178432 Wow, @CNN got caught fixing their "focus group" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!