基于链式拆分的熊猫过滤器数据框

时间:2020-08-13 18:27:00

标签: python pandas

我有一个pandas数据框,其中包含带有文件名的列(列名filenames)。文件名看起来像:

long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...

要进行过滤,请执行此操作(让我们说“ select_string =“ 0”):

df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]

但是我被抛出了:

Traceback (most recent call last):
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "python_file.py", line 118, in <module>
    main()
  File "inference.py", line 57, in main
    _=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
  File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
    logger=logger, select_string=select_string)
  File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
    df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
    return self._get_value(key)
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
    loc = self.index.get_loc(label)
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    raise KeyError(key) from err
KeyError: 0

我认为它不喜欢我将拆分链链接起来,但是我隐约记得前一段时间这样做并且确实起作用了。所以,我很困惑为什么它会引发此错误。

PS:我确实知道如何使用.contains来解决,但是我想使用这种比较字符串的方法。

任何指针都很棒!

3 个答案:

答案 0 :(得分:1)

这是另一种方法,使用.str.extract()

import pandas as pd

df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
                                'long_file2_name_1.jpg',
                                'long_file3_name_0.jpg',
                                'long_file3_name_33.jpg',]
                  })

现在,创建一个布尔掩码。 squeeze()方法可确保我们有一个序列,因此遮罩将起作用:

mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
          .astype(int)
          .eq(0)
          .squeeze())

print(df.loc[mask])

                filename
0  long_file1_name_0.jpg
2  long_file3_name_0.jpg

答案 1 :(得分:0)

假设所有行都包含.jpg,如果没有,请将其更改为仅.

select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]

答案 2 :(得分:0)

此部分:

df_fp["filenames"].str.split(".jpg")[0]

返回DataFrame的第一行,而不是列表的第一元素。

您要寻找的是expand(它将在split之后为列表中的每个元素创建一个新列)

df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']

或者,您可以通过套用来做到这一点:

df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']

但是contains在这里绝对更合适。