我正在尝试查找多个.txt文件之间的相似之处。我将所有这些文件放在字典中,文件名作为键。
当前代码:
import pandas as pd
from os import listdir, chdir, getcwd
path = (r'C:\...path')
chdir(path)
files = [f for f in listdir(path)]
files_dict = {}
for filename in files:
if filename.lower().endswith(('.txt')):
files_dict[str(filename)] = pd.read_csv(filename).to_dict('split')
for key, value in files_dict.items():
print(key + str(value) +'\n')
在这种情况下,键是文件名。该值是标题和数据。 我想找出多个文件之间的值是否重复,以便可以在SQL中加入它们。我不确定该怎么做
编辑示例文件:
timestamp,Name,Description,Default Column Layout,Analysis View Name
00000000B42852FA,ADM_EIG,Administratief eigenaar,ADM_EIG,ADM_EIG
000000005880959E,OPZ,Opzeggingen,STANDAARD,
从代码中:
Acc_ Schedule Name.txt{'index': [0, 1], 'columns': ['timestamp', 'Name', 'Description', 'Default Column Layout', 'Analysis View Name'], 'data': [['00000000B42852FA', 'ADM_EIG', 'Administratief eigenaar', 'ADM_EIG', 'ADM_EIG'], ['000000005880959E', 'OPZ', 'Opzeggingen', 'STANDAARD', nan]]}
修改2:建议的代码
for key, value in files_dict.items():
data = value['data']
counter = Counter([item for sublist in data for item in sublist])
print([value for value, count in counter.items()])
输出:['00000000B99BD831', 5050, 'CK102', '0,00000000000000000000', 'Thuiswonend', 0, '00000000B99BD832', ........
答案 0 :(得分:0)
Counter
计算项目的出现频率,因此会告诉您出现多次的信息。从字典中拉出data
:
from Collections import Counter
data = [
['00000000B42852FA', 'ADM_EIG', 'Administratiefeigenaar', 'ADM_EIG', 'ADM_EIG'],
['000000005880959E', 'OPZ', 'Opzeggingen', 'STANDAARD', nan]
]
您需要拼合列表列表:
[item for sublist in data for item in sublist]
柜台会为您提供每个商品的频率:
>>> Counter([item for sublist in data for item in sublist])
Counter({'ADM_EIG': 3, '00000000B42852FA': 1, 'Administratief eigenaar': 1, '000000005880959E': 1, 'OPZ': 1, 'Opzeggingen': 1, 'STANDAARD': 1, nan: 1})
您可以根据需要进行过滤:
counter = Counter([item for sublist in data for item in sublist])
[value for value, count in counter.items() if count > 1]
给出['ADM_EIG']
编辑以匹配问题编辑
要查看所有行,请获取所有数据并查找重复项:
data = []
for key, value in files_dict.items():
data.extend(value['data'])
counter = Counter([item for sublist in data for item in sublist])
print([value for value, count in counter.items() if count > 1])
答案 1 :(得分:0)
如果所有文件中的所有列都相同,我想您可以按照以下方式使用pd.duplicated()
:
import pathlib
import pandas as pd
def read_txt_files(dir_path):
df_list = []
for filename in pathlib.Path(dir_path).glob('*.txt'):
# print(filename)
df = pd.read_csv(filename, index_col=0)
df['filename'] = filename # just to save filename as an optional key
df_list.append(df)
return pd.concat(df_list)
df = read_txt_files(r'C:\...path') # probably you should change path in this line
df.set_index('filename', append=True, inplace=True)
print(df)
Name Description ...
timestamp filename
00000000B42852FA first.txt ADM_EIG Administratief eigenaar ...
000000005880959E first.txt OPZ Opzeggingen ...
00000000B42852FA second.txt ADM_EIG Administratief eigenaar ...
000000005880959K second.txt XYZ Opzeggingen ...
因此您可以获得包含重复数据的索引:
df.duplicated(keep='first')
Out:
timestamp filename
00000000B42852FA first.txt False
000000005880959E first.txt False
00000000B42852FA second.txt True
000000005880959K second.txt False
dtype: bool
并使用它来过滤数据:
df[~df.duplicated(keep='first')]
Out:
Name Description ...
timestamp filename
00000000B42852FA first.txt ADM_EIG Administratief eigenaar ...
000000005880959E first.txt OPZ Opzeggingen ...
000000005880959K second.txt XYZ Opzeggingen ...
编辑:示例在不同文件中具有不同的列,但使用相同的方案。 first.txt:
timestamp,Name,Descr,Column Layout,Analysis View Name
00000000B42852FA,ADM_EIG,Administratief eigenaar,ADM_EIG,ADM_EIG
000000005880959E,OPZ,Opzeggingen,STANDAARD,
second.txt:
timestamp,Descr,Default Column Layout,Analysis View Name
00000000B42852FA,Administratief,ADM_EIG,ADM_EIG
000000005880959K,Opzeggingen,STANDAARD,
third.txt
timestamp,Descr,Default Column Layout,Analysis View Name
00000000B42852FA,Administratief eigenaar,ADM_EIG,ADM_EIG
000000005880959K,Opzeggingen,STANDAARD,
second.txt和third.txt的最后一行重复。
应用相同的代码:
...
print(df)
Out: # partial because it's to wide
Analysis View Name Column Layout ...
timestamp filename
00000000B42852FA first.txt ADM_EIG ADM_EIG ...
000000005880959E first.txt NaN STANDAARD ...
00000000B42852FA second.txt ADM_EIG NaN ...
000000005880959K second.txt NaN NaN ...
00000000B42852FA third.txt ADM_EIG NaN ...
000000005880959K third.txt NaN NaN ...
缺少的值(如果.txt中没有这样的列)用NaN填充。 找到重复的列:
df.duplicated(keep='first')
Out:
timestamp filename
00000000B42852FA first.txt False
000000005880959E first.txt False
00000000B42852FA second.txt False
000000005880959K second.txt False
00000000B42852FA third.txt False
000000005880959K third.txt True