我有一个pandas数据框,其中包含带有文件名的列(列名filenames
)。文件名看起来像:
long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...
要进行过滤,请执行此操作(让我们说“ select_string =“ 0”):
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
但是我被抛出了:
Traceback (most recent call last):
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python_file.py", line 118, in <module>
main()
File "inference.py", line 57, in main
_=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
logger=logger, select_string=select_string)
File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 0
我认为它不喜欢我将拆分链链接起来,但是我隐约记得前一段时间这样做并且确实起作用了。所以,我很困惑为什么它会引发此错误。
PS:我确实知道如何使用.contains
来解决,但是我想使用这种比较字符串的方法。
任何指针都很棒!
答案 0 :(得分:1)
这是另一种方法,使用.str.extract()
:
import pandas as pd
df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
'long_file2_name_1.jpg',
'long_file3_name_0.jpg',
'long_file3_name_33.jpg',]
})
现在,创建一个布尔掩码。 squeeze()
方法可确保我们有一个序列,因此遮罩将起作用:
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
.astype(int)
.eq(0)
.squeeze())
print(df.loc[mask])
filename
0 long_file1_name_0.jpg
2 long_file3_name_0.jpg
答案 1 :(得分:0)
假设所有行都包含.jpg
,如果没有,请将其更改为仅.
select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]
答案 2 :(得分:0)
此部分:
df_fp["filenames"].str.split(".jpg")[0]
返回DataFrame的第一行,而不是列表的第一元素。
您要寻找的是expand
(它将在split
之后为列表中的每个元素创建一个新列)
df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']
或者,您可以通过套用来做到这一点:
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']
但是contains
在这里绝对更合适。