从输出

时间:2016-06-27 00:22:09

标签: python-2.7 csv pandas dataset data-cleansing

我正在尝试将#keywords中的所有tweetText与其他列一起放入单独的列中。我没有提到其他专栏,因为它们只会造成混乱。

没有tweetText的{​​{1}}将被删除,而那些已被删除的#keywords将它们放在不同的栏目中。

在我需要从#Keywords过滤tweetText的部分中,我有点迷失。

输入:TweetsID,推文(还有更多列)

714602054988275712,I'm at MK Appartaments in Dobele
714600471512670212,"Baana bicycle counter.Today: 9 Same time last week: 7 Trend: ↑28% This year: 60 811 Last year: 802 079 #Helsinki #pyöräily #cycling"
714598616703320065,"Just posted a photo @ Moscow, Russia"
714593900053180416,We're #hiring! Read about our latest #job opening here: CRM Specialist #lifeinspiringcareers #Moscow #Sales
714591942949138434,Just posted a photo @ Kfc 
714591380660731904,Homeless guide on my festival of tours from locals for locals #открытаякарта. Shot by Alexandr
714591338977579009,"Who we are? #edmonton #edm #edmlife #edms #edmlifestyle #edmfamily #edmgirls #edmlov"

预期输出:tweetId,hashKey(也将包含其他列)

714600471512670212,#Helsinki #pyöräily #cycling
714593900053180416,#hiring! #lifeinspiringcareers #Moscow #Sales
714591380660731904,#открытаякарта
714591338977579009,#edmonton #edm #edmlife #edms #edmlifestyle #edmfamily #edmgirls #edmlov"

代码:

import pandas as pd

df1 = pd.read_csv('Turkey_28.csv')

key_word = df1[['tweetID', 'tweetText']].set_index('tweetID')['tweetText']

key_word = key_word.dropna().apply(lambda x: eval(x))
key_word = key_word[key_word.apply(type) == dict]

 #I am lost in this section on how to select the hash keywords?   
def get_key_words(x):                                                       
    return pd.Series(x['tweetText'], 

key_word = key_word.apply(get_key_word)

df2 = pd.concat([coords, df1.set_index('tweetID').reindex(coords.index)], axis=1)

df2.to_csv('Turkey_key_word.csv', index=True)

欣赏建议。

编辑一个:

在所选答案中解析输入时,我得到一些语法错误

代码:

import re
import pandas as pd

df = pd.readcsv('Turkey_Text.csv')
tweet_column = ['tweetText']
for idx in range(len(tweet_column)):
    tweet = tweet_column[idx]
    hashtag_list = re.findall(r('#\w+)', tweet)
    tweet_column[idx] = " ".join(hashtag_list)

print tweet_column[idx]

错误:

File "keyword_split.py", line 9
    tweet_column[idx] = " ".join(hashtag_list)
               ^
SyntaxError: invalid syntax

预期产出

714600471512670212,#Helsinki 
714600471512670212,#pyöräily 
714600471512670212,#cycling
714593900053180416,#hiring! 
714593900053180416,#lifeinspiringcareers 
714593900053180416,#Moscow 
714593900053180416,#Sales
714591380660731904,#открытаякарта
714591338977579009,#edmonton 
714591338977579009,#edm 
714591338977579009,#edmlife 
714591338977579009,#edms 
714591338977579009,#edmlifestyle 
714591338977579009,#edmfamily 
714591338977579009,#edmgirls 
714591338977579009,#edmlov"

1 个答案:

答案 0 :(得分:1)

使用python and regular expressions。它会让你的生活更轻松。 正则表达式r'#(\w+)'在这种情况下运行良好。

我不完全理解您的代码流程,因为我没有太多使用panda搜索CSV的经验,但如果您要隔离推文并将一串关键字/ hashtags返回到该列我对常规python逻辑的理解,它可能看起来像这样......

import re

for idx in range(len(tweet_column)):
    tweet = tweet_column[idx]
    hashtag_list = re.findall(r('#\w+)', tweet)
    tweet_column[idx] = " ".join(hashtag_list)

Here's another example