搜索查询仅适用于第一行数据帧

时间:2018-03-29 09:54:33

标签: python python-3.x pandas

我有一个恼人的问题。我有一个包含两行的数据框:第一行包含由推文及其日期组成的元组,两者都是字符串数据(' text',' date')。我希望查询每一行是否存在特定术语,并返回一个新的数据框,其中只包含那些具有我想要的术语的推文。我知道这两行都有几个带有相关术语的条目。这是我的代码:

data = pd.read_pickle('filepath.pkl') 

dict_twit = {k:[] for k in data.index} ## creates empty dict for relevant tweets to go into

for i in data.index: ### data has a text-based index
    try:
        relevant_tweet = []
        for j in range(len(data.loc[i])):
            if 'query' in data.loc[i][j][0].lower():
                relevant_tweet.append(data.loc[i][j])
        dict_twit[i] = relevant_tweet
    except TypeError: ### The are empty cells in some rows
        dict_twit[i] = []

tweets_df = pd.DataFrame.from_dict(dict_twit, orient = 'index')

但是,当我运行代码时,只有tweets_df的第一行有任何文本;第二行是空的。谁能看到我在这里做错了什么?

编辑:这里有一些示例数据:

Index                Entries
digi_marketing_20th: ('RT @bigbomglobal: ? ? ?  Bigbom Interview with Dr. Long Vuong, Founder and CEO of Tomochain at MOU SIGNING CEREMONY ', '20/03/2018') , ('The latest ? eDGTL? News ?!  #digitalmarketing', '20/03/2018')
digi_marketing_21st: ('#DigitalMarketing See Top 3 Content creation tools Updated for 2017 ', '21/03/2018'), ('RT @sheerazhasan: Sheeraz, Inc digital marketing strategy for your business or brand! #digitalmarketing #socialmedia', '21/03/2018')

1 个答案:

答案 0 :(得分:1)

使用collections.defaultdict这是一种更有效的方式。

出于性能原因,首选df.itertuples超过df.iterrows,因为后者的开销很大。

from collections import defaultdict
import pandas as pd

df = pd.DataFrame([['digi_marketing_20th:', ('RT @bigbomglobal: ? ? ?  Bigbom Interview with Dr. Long Vuong, Founder and CEO of Tomochain at MOU SIGNING CEREMONY ', '20/03/2018') , ('The latest ? eDGTL? News ?!  #digitalmarketing', '20/03/2018')],
                   ['digi_marketing_21st:', ('#DigitalMarketing See Top 3 Content creation tools Updated for 2017 ', '21/03/2018'), ('RT @sheerazhasan: Sheeraz, Inc digital marketing strategy for your business or brand! #digitalmarketing #socialmedia', '21/03/2018')]],
                  columns=['Index', 'Col1', 'Col2'])

#                   Index                                               Col1  \
# 0  digi_marketing_20th:  (RT @bigbomglobal: ? ? ?  Bigbom Interview wit...   
# 1  digi_marketing_21st:  (#DigitalMarketing See Top 3 Content creation ...   

d = defaultdict(list)

for idx, row in enumerate(df.itertuples()):
    for tweet, date in row[2:]:
        if 'digital' in tweet.lower():
            d[idx].append(tweet)

# defaultdict(list,
#             {0: ['The latest ? eDGTL? News ?!  #digitalmarketing'],
#              1: ['#DigitalMarketing See Top 3 Content creation tools Updated for 2017 ',
#               'RT @sheerazhasan: Sheeraz, Inc digital marketing strategy for your business or brand! #digitalmarketing #socialmedia']})