根据多种条件在熊猫数据框中删除行(一种是正则表达式)

时间:2020-06-09 19:09:52

标签: python pandas dataframe

我有以下名为df的熊猫数据框:

   value1    value2   value3...........
0  ATG CX A  setB     ...
1  CTG CX B  setB     ...
2  AAG       setA     ...
3  AAG       setB
4  CTG       setA
5  CTG CX C  setB
6  GGG       setA
7  ATG       setA
8  AAG CX A  setB
9  GGG       setB
10 A7T       setB

我希望删除列,其中删除value1包含以CX结尾且后跟一个随机字母的字符串的行。

然后,如果value1setA中有相同的setB,我想保留setA并放下setB

我希望我的最终数据帧看起来像这样。

   value1    value2   value3...........
2  AAG       setA     ...
4  CTG       setA
6  GGG       setA
7  ATG       setA
10 A7T       setB

所以我尝试了以下命令:

df = df.drop(df['value1'].str.contains(r'\sPR\s.+$'))

但是我遇到很多nan的错误,

KeyError: '[nan nan nan na...........................]' not found in axis

然后我尝试了:

df = df[:, df['value1'].str.contains(r'\sPR\s.+$')]

df = df.drop_duplicates(subset='value1', keep='first')

但我知道

ValueError: Location based indexing can only have [labels (MUST BE IN THE INDEX), slices of labels (BOTH endpoints included! Can be slices of integers if the index is integers), listlike of labels, boolean] types

为什么会出现此错误?如何实现自己想做的事?

1 个答案:

答案 0 :(得分:0)

我这样构造了一个玩具数据框:

d = {"value1": ["ATG CX A", "CTG CX B", "AAG", "AAG", "CTG", "CTG CX C", "GGG", "ATG", "AAG CX A", "GGG", "A7T"],
"value2": ["B", "B", "A", "B", "A", "B", "A", "A", "B", "B", "B"]}

df = pd.DataFrame(d)

print(df)
      value1 value2
0   ATG CX A      B
1   CTG CX B      B
2        AAG      A
3        AAG      B
4        CTG      A
5   CTG CX C      B
6        GGG      A
7        ATG      A
8   AAG CX A      B
9        GGG      B
10       A7T      B

然后我在否定str.contains()部分的同时使用了.loc运算符,例如:

df = df.loc[~df['value1'].str.contains(r'\sCX\s.+$'),:]

print(df)
   value1 value2
2     AAG      A
3     AAG      B
4     CTG      A
6     GGG      A
7     ATG      A
9     GGG      B
10    A7T      B