我有两个看起来像这样的数据框:
df_regex_test = pd.DataFrame(columns=['file_names', 'searched_for_found', 'everything'])
df_regex_test_temp = pd.DataFrame(columns=['file_names', 'searched_for_found', 'everything'])
它们都是空的数据帧,例如:
Empty DataFrame
Columns: [file_names, searched_for_found, everything]
Index: []
我有第三个数据框,其中包含实际数据:
df_all_xml_mfiles = pd.merge(df_all_xml_data, files_only, left_on="file_names", right_on="file_names", how="inner")
df_all_xml_mfiles_tgther = df_all_xml_mfiles.groupby(['file_names', 'searched_for_found'])['everything'].apply(' '.join).reset_index()
我正在执行以下操作:
for cc in range(0, len(file_names_only), 1):
for bb in range(0, len(search_content_array), 1):
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].everything.str.findall('(<[^<]*?' + search_content_array[bb] + '[^>]*?>)', re.IGNORECASE)
if not regex_stuff.empty:
print('\n')
df_regex_test_temp = df_regex_test_temp.append(regex_stuff, ignore_index=True, sort=True)
print(df_regex_test_temp.head(5))
df_regex_test_temp['searched_for_found'] = search_content_array[bb]
df_regex_test_temp['file_names'] = file_names_only[cc]
df_regex_test = df_regex_test.append(df_regex_test_temp, ignore_index=True, sort=False)
df_regex_test_temp = df_regex_test_temp.iloc[0:0]
if regex_stuff.empty:
df_regex_test_temp = df_regex_test_temp.iloc[0:0]
我正在这样输出文件:
text_regex_test= df_regex_test.to_csv('C:\\somewhere\\regex_test.txt', sep='\t')
当我查看输出文件时,会看到以下内容:
file_names searched_for_found everything 0 1
0 example_file.dtsx chair I like chairs. Chairs are nice.
1 example_file.dtsx desk I like desks. Desks are awesome.
2 example_file_2.dtsx chair Chairs are lame.
3 example_file_2.dtsx desk Desks are more fun than chairs.
熊猫创建了“ 0”和“ 1”列,但我希望所有内容都位于“所有”列中。
我做错了什么?我以为可能与列未正确对齐有关,但据我所知并非如此。
这是我期望的输出:
file_names searched_for_found everything
0 example_file.dtsx chair I like chairs. Chairs are nice.
1 example_file.dtsx desk I like desks. Desks are awesome.
2 example_file_2.dtsx chair Chairs are lame.
3 example_file_2.dtsx desk Desks are more fun than chairs.
编辑#1:
如果我要对此行进行注释:
df_all_xml_mfiles_tgther[cc:cc+1].everything.str.findall('(<[^<]*?' + search_content_array[bb] + '[^>]*?>)', re.IGNORECASE)
我没有那个问题。这与这条线有关。列是相同的,所以我不确定为什么会导致该问题。
编辑#2:
如果我不注释上面的行,但是注释掉下面的行,那么我也没有这个问题。这条线的东西。 。
df_regex_test_temp = df_regex_test_temp.append(regex_stuff, ignore_index=True, sort=True)