Question

我有一个pandas数据框，其中包含带有文件名的列（列名filenames）。文件名看起来像：

long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...

要进行过滤，请执行此操作（让我们说“ select_string =“ 0”）：

df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]

但是我被抛出了：

Traceback (most recent call last):
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "python_file.py", line 118, in <module>
    main()
  File "inference.py", line 57, in main
    _=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
  File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
    logger=logger, select_string=select_string)
  File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
    df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
    return self._get_value(key)
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
    loc = self.index.get_loc(label)
  File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    raise KeyError(key) from err
KeyError: 0

我认为它不喜欢我将拆分链链接起来，但是我隐约记得前一段时间这样做并且确实起作用了。所以，我很困惑为什么它会引发此错误。

PS：我确实知道如何使用.contains来解决，但是我想使用这种比较字符串的方法。

任何指针都很棒！

Answer 1

这是另一种方法，使用.str.extract()：

import pandas as pd

df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
                                'long_file2_name_1.jpg',
                                'long_file3_name_0.jpg',
                                'long_file3_name_33.jpg',]
                  })

现在，创建一个布尔掩码。 squeeze()方法可确保我们有一个序列，因此遮罩将起作用：

mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
          .astype(int)
          .eq(0)
          .squeeze())

print(df.loc[mask])

                filename
0  long_file1_name_0.jpg
2  long_file3_name_0.jpg

Answer 2

假设所有行都包含.jpg，如果没有，请将其更改为仅.

select_string=str(0) #select string should be of type str

df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]

Answer 3

此部分：

df_fp["filenames"].str.split(".jpg")[0]

返回DataFrame的第一行，而不是列表的第一元素。

您要寻找的是expand（它将在split之后为列表中的每个元素创建一个新列）

df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']

或者，您可以通过套用来做到这一点：

df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']

但是contains在这里绝对更合适。

基于链式拆分的熊猫过滤器数据框

3 个答案: