我有以下名为df的熊猫数据框:
value1 value2 value3...........
0 ATG CX A setB ...
1 CTG CX B setB ...
2 AAG setA ...
3 AAG setB
4 CTG setA
5 CTG CX C setB
6 GGG setA
7 ATG setA
8 AAG CX A setB
9 GGG setB
10 A7T setB
我希望删除列,其中删除value1
包含以CX
结尾且后跟一个随机字母的字符串的行。
然后,如果value1
和setA
中有相同的setB
,我想保留setA
并放下setB
。
我希望我的最终数据帧看起来像这样。
value1 value2 value3...........
2 AAG setA ...
4 CTG setA
6 GGG setA
7 ATG setA
10 A7T setB
所以我尝试了以下命令:
df = df.drop(df['value1'].str.contains(r'\sPR\s.+$'))
但是我遇到很多nan
的错误,
KeyError: '[nan nan nan na...........................]' not found in axis
然后我尝试了:
df = df[:, df['value1'].str.contains(r'\sPR\s.+$')]
df = df.drop_duplicates(subset='value1', keep='first')
但我知道
ValueError: Location based indexing can only have [labels (MUST BE IN THE INDEX), slices of labels (BOTH endpoints included! Can be slices of integers if the index is integers), listlike of labels, boolean] types
为什么会出现此错误?如何实现自己想做的事?
答案 0 :(得分:0)
我这样构造了一个玩具数据框:
d = {"value1": ["ATG CX A", "CTG CX B", "AAG", "AAG", "CTG", "CTG CX C", "GGG", "ATG", "AAG CX A", "GGG", "A7T"],
"value2": ["B", "B", "A", "B", "A", "B", "A", "A", "B", "B", "B"]}
df = pd.DataFrame(d)
print(df)
value1 value2
0 ATG CX A B
1 CTG CX B B
2 AAG A
3 AAG B
4 CTG A
5 CTG CX C B
6 GGG A
7 ATG A
8 AAG CX A B
9 GGG B
10 A7T B
然后我在否定str.contains()部分的同时使用了.loc
运算符,例如:
df = df.loc[~df['value1'].str.contains(r'\sCX\s.+$'),:]
print(df)
value1 value2
2 AAG A
3 AAG B
4 CTG A
6 GGG A
7 ATG A
9 GGG B
10 A7T B