清理csv文件中的数据

时间:2018-03-22 13:45:05

标签: python pandas nlp jupyter-notebook data-cleaning

我正在对crpytocurrency进行情绪分析。我的工作是清理csv文件中的数据。生成数据(来自Twitter)并保存在csv文件中。在做感情分析之前。我必须清理数据。例如,删除标点符号,URL,将测试放在小写字母中。这些是推特。

##我已经导入了有用的库,例如NLTK(自然语言处理),pandas,numpy等。

这是'推文'的输出。列。

   ctweet['Tweets'][0:6]



 Out[5]:


    0    RT @TheLTCnews: The @LTCFoundation has publish...
    1    RT @WildchildSings: "https:/ " + /t.co/"FZrGw6xsZU ac..."
    2    RT @HODL_Whale: 5 days until #LitePay launches...
    3    LTC to USD price $211.92 "https:/" + /t.co/"CFjg1mIg..."
    4    LTC to BTC price B0.020218 "https:/" +/t.co/"XPL8NI..."
    5    LTC to GBP price £151.89 "https:/" +/t.co/"iOIbhgyd..."
    6    Litecoin dropped into the bear zone as sugges...
    Name: Tweets, dtype: object

# the output contains url. Because stackoverflow won't allow me to post the url. I have to change the method for url like adding "quotes" and "//".  

我的下一个任务是清理数据。这是预处理代码。

#Preprocessing del RT @blablabla:
ctweet['tweetos'] = '' 

#add tweetos first part
for i in range(len(ctweet['Tweets'])):
    try:
        ctweet['tweetos'][i] = ctweet['Tweets'].str.split(' ')[i][0]
    except AttributeError:    
        ctweet['tweetos'][i] = 'other'

        #Preprocessing tweetos. select tweetos contains 'RT @'
        for i in range(len(ctweet['Tweets'])):
            if ctweet['tweetos'].str.contains('@')[i]  == False:
                ctweet['tweetos'][i] = 'other'

        # remove URLs, RTs, and twitter handles
        for i in range(len(ctweet['Tweets'])):
            ctweet['Tweets'][i] = " ".join([word for word in ctweet['Tweets'][i].split()
                                        if 'http' not in word and '@' not in word and '<' not in word])

  ctweet['Tweets'][0]

上面的代码将删除标点符号,网址,将测试放在小写字母中,提取用户名以获取示例。当我运行该代码时,它会出错。

TypeErrorTraceback (most recent call last)
<ipython-input-3-8254e078073a> in <module>()
      5 for i in range(len(ctweet['Tweets'])):
      6     try:
----> 7         ctweet['tweetos'][i] = ctweet['Tweets'].str.split(' ')[i][0]
      8     except AttributeError:
      9         ctweet['tweetos'][i] = 'other'

TypeError: 'float' object has no attribute '__getitem__'

这个错误是什么意思?我怎么解决这个问题。我正在使用Jupyter Notebook 5.4.1

更新部分

AttributeErrorTraceback (most recent call last)
<ipython-input-7-bb6b24f62739> in <module>()
     16 # remove URLs, RTs, and twitter handles
     17 for i in range(len(ctweet['Tweets'])):
---> 18     ctweet['Tweets'][i] = " ".join([word for word in ctweet['Tweets'][i].split()
     19                                 if 'http' not in word and '@' not in word and '<' not in word])
     20 

AttributeError: 'float' object has no attribute 'split'

1 个答案:

答案 0 :(得分:0)

看起来ctweet是一个字典,因此您需要指向如下索引:

ctweet['tweetos'][i] = ctweet['Tweets'][i].str.split(' ')[0]

代替: ctweet['tweetos'][i] = ctweet['Tweets'].str.split(' ')[i][0]