我有一个数据框,它有以下属性; id,text,created_at,retweet_count,favorite_count,source,user_id
我希望通过弹出以" RT"开头的df.text行来获取新的数据帧。
non_retweeted_list = []
for i in range(len(df)):
if (df.text[i][0] and df.text[i][1]) == ('R' and 'T'):
pass
else:
non_retweeted_list.append(df[i])
但是我得到了KeyError:
KeyError
Traceback (most recent call last)
/home/bd/anaconda3/lib/python3.5/site-packages/pandas/indexes
/base.py in get_loc(self, key, method, tolerance)
1944 try:
-> 1945 return self._engine.get_loc(key)
1946 except KeyError:
.
.
.
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-3-5dfc6d77a22c> in <module>()
5 pass
6 else:
----> 7 non_retweeted_list.append(df[i])
.
.
.
KeyError: 0
我该如何解决?
答案 0 :(得分:2)
您需要boolean indexing
startswith
作为掩码:
df = pd.DataFrame({'text':['RT apple','dog','RT baladiska']})
print (df)
text
0 RT apple
1 dog
2 RT baladiska
mask = df['text'].str.startswith('RT')
print (mask)
0 True
1 False
2 True
Name: text, dtype: bool
#filter out columns start with RT
df1 = df[~mask]
print (df1)
text
1 dog
#filter values starting RT
df2 = df[mask]
print (df2)
text
0 RT apple
2 RT baladiska
可替换地:
mask = df['text'].str.contains('^RT')
答案 1 :(得分:1)
可能是您引用索引的方式。 此外,这是检查前两个字符的奇怪方法。你为什么这样做?您如何看待我在下面展示的方式?
non_retweeted_list = []
for i in range(len(df)):
if 'RT' == df['text'][df.index==i][0:2]:
pass
else:
non_retweeted_list.append(df[df.index[i]])
最后,做一个if-pass
声明可能不是一个好主意。使用否定
non_retweeted_list = []
for i in range(len(df)):
if 'RT' != df['text'][df.index==i][0:2]:
non_retweeted_list.append(df[df.index==i])