我有一个包含以下内容的数据文件:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
从这里开始,我希望有一个字典字典,每个字母都包含摘要数据,因此将是这样的:
dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
B:{10:6, 20:2, 30:3}}},
{Part#2:{A:{10:5, 20:3, 30:3},
B:{10:7, 20:1, 30:3}}},
{Part#3:{A:{10:6, 20:4, 30:1},
B:{10:4, 20:5, 30:2}}}}
如果我想显示每个部分,它将为我提供如下输出:
dictionary[Part#1]
A
10: 6
20: 3
30: 2
B
10: 6
20: 2
30: 3
…等等,对于文件中的下几个分区。
目前,我已经能够将文件从txt解析为csv。并把它转换成字典,比方说外部字典。我已经测试了几种查看输出结果的方法,到目前为止,这段代码与我正在寻找的结构更接近(但不是全部),我已经在上面进行了描述。
partitions_dict = df_head(5).to_dict(orient='list')
print(partitions_dict)
Output:
{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]
我用来解析文件的函数:
def fileFormatConverter(txt_file):
""" Receives a generated text file of partitions as a parameter
and converts it into csv format.
input: text file
return: csv file """
filename, ext = os.path.splitext(txt_file)
csv_file = filename + ".csv"
in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
out_csv = csv.writer(open(csv_file,'w'))
out_csv.writerows(in_txt)
return (csv_file)
# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None) #, error_bad_lines=False)
df_traces.head()
输出:
0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 A, 10, 20, 10, 10, 30, 10, 20, 10, 30, ... 20, 10, 10, 30, 10, 30, 10, 20, 30.0 NaN
1 Part#2 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 A, 30, 30, 30, 10, 10, 20, 20, 20, 10, ... 20, 10, 10, 30, 10, 30, 10, 30, 10.0 NaN
3 Part#3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 A, 10, 20, 10, 30, 10, 20, 10, 20, 10, ... 20, 20, 20, 30, 10, 10, 20, 20, 30.0 NaN
我使用了一个函数来更改标题,因此可以更轻松地操作每个分区内的字母:
def changeDFHeaders(df):
df_transpose = df.T
new_header = df_transpose.iloc[0] # stores the first row for the header
df_transpose = df_transpose[1:] # take the data less the header row
df_transpose.columns = new_header # set the header row as the df header
return(df_transpose)
# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose)) # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()
输出:
A, Part#2 A, Part#3 A,
1 10, NaN 30, NaN 10,
2 20, NaN 30, NaN 20,
3 10, NaN 30, NaN 10,
4 10, NaN 10, NaN 30,
5 30, NaN 10, NaN 10,
6 10, NaN 20, NaN 20,
7 20, NaN 20, NaN 10,
8 10, NaN 20, NaN 20,
9 30, NaN 10, NaN 10,
10 10, NaN 10, NaN 20,
11 20, NaN 10, NaN 10,
12 B, NaN B, NaN B,
13 10, NaN 10, NaN 10,
14 10, NaN 10, NaN 10,
15 20, NaN 20, NaN 20,
16 10, NaN 10, NaN 20,
17 10, NaN 10, NaN 20,
18 30, NaN 30, NaN 30,
19 10, NaN 10, NaN 10,
20 30, NaN 30, NaN 10,
21 10, NaN 10, NaN 20,
22 20, NaN 30, NaN 20,
23 30 NaN 10 NaN 30
24 NaN NaN NaN NaN NaN
-仍然不太正确...
,如果您检查以下语句:
df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')
print(partitions_dict)
输出:
{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}
答案 0 :(得分:2)
我会避免熊猫,只是因为我不太了解它:
from collections import Counter
result = {}
part = ""
group = ""
for line in f: # f being an open file
sline = line.strip()
if sline.startswith("Part"):
part = sline
result[part] = {}
continue
group = sline.split()[0]
result[part][group] = Counter(sline.split()[1:])
结果采用以下形式:
{'Part#1': {'A': Counter({'10': 6, '20': 3, '30': 2}), 'B': Counter({'10': 6, '30': 3, '20': 2})},
'Part#2': {'A': Counter({'10': 5, '30': 3, '20': 3}), 'B': Counter({'10': 7, '30': 3, '20': 1})},
'Part#3': {'A': Counter({'10': 6, '20': 4, '30': 1}), 'B': Counter({'20': 5, '10': 4, '30': 2})}}
如果直接从没有行分隔的文件开始,则可以使用“ Part”查找行,然后使用“ B”的索引来分隔两种数据类型:
result = {}
sf = f.split("Part")[1:] # drop the empty first part
for line in sf:
line = line.strip() # remove trailing spaces
sline = line.split() # split on spaces
result["Part%s" % sline[0]] = {} # Use the index of B to split the value lists
result["Part%s" % sline[0]][sline[1]] = Counter(sline[2:sline.index("B")])
result["Part%s" % sline[0]]["B"] = Counter(sline[sline.index("B") + 1:])
答案 1 :(得分:0)
输入文件为:
Part#1
A 10 20 10 10 30 10 20 10 30 10 20
B 10 10 20 10 10 30 10 30 10 20 30
Part#2
A 30 30 30 10 10 20 20 20 10 10 10
B 10 10 20 10 10 30 10 30 10 30 10
Part#3
A 10 20 10 30 10 20 10 20 10 20 10
B 10 10 20 20 20 30 10 10 20 20 30
这应该有效
def parse_file(file_name):
return_dict = dict()
section = str()
with open(file_name, "r") as source:
for line in source.readlines():
if "#" in line:
section = line.strip()
return_dict[section] = dict()
continue
tmp = line.strip().split()
group = tmp.pop(0)
return_dict[section][group] = dict()
for item in tmp:
if item in return_dict[section][group].keys():
return_dict[section][group][item] += 1
else:
return_dict[section][group][item] = 1
return return_dict
输出
{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
'B': {'10': 6, '20': 2, '30': 3}},
'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
'B': {'10': 7, '20': 1, '30': 3}},
'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
'B': {'10': 4, '20': 5, '30': 2}}}
老实说,我不明白为什么需要中间阶段,好像您必须解析一次文件以创建CSV一样,只需在其中创建dict()的逻辑即可。因此,如果我错过了这个问题的微妙之处,我表示歉意。
编辑:根据评论重新制定答案,即输入文件实际上是一行
因此输入文件为
Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30
以下修改后的代码将起作用
import string
from pprint import pprint
def parse_file2(file_name):
return_dict = dict()
section = None
group = None
with open(file_name, "r") as source:
for line in source.readlines():
tmp_line = line.strip().split()
for token in tmp_line:
if "#" in token:
section = token
return_dict[section] = dict()
continue
elif token in string.ascii_uppercase:
group = token
return_dict[section][group] = dict()
continue
if section and group:
if token in return_dict[section][group].keys():
return_dict[section][group][token] += 1
else:
return_dict[section][group][token] = 1
return return_dict
if __name__ == "__main__":
pprint(parse_file(file_name))
pprint(parse_file2(file_name2))
请注意,此功能专门用于注释中提到的文件格式。如果文件格式不符合您的要求,则可能会爆炸。
基于该问题,尽管这应该可行。
此外,如果您可以简化上面的问题帖以说明实际的文件内容和所需的结果,或者仅放入我具有结构A并希望将其转换为结构B的内容,则我将清理所有历史记录在这篇文章中,还有一个更简单的答案。
希望这会有所帮助! :)