我有一个外部* .xls文件,我已将其转换为一个包含数据块的* .csv文件,例如:
"Legend number one";;;;Number of items;6
X;-358.6806792;-358.6716338;;;
Y;0.8767189;0.8966855;Avg;;50.1206378
Z;-0.7694626;-0.7520983;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
;;;;;
有很多街区。
每个块可能包含一些额外的行数据;
"Legend number six";;;;Number of items;19
X;-358.6806792;-358.6716338;;;
Y;0.8767189;0.8966855;Avg;;50.1206378
Z;-0.7654644;-0.75283;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
A;0;1;Value;;0
B;1;0;;;
;;;;;
结构是这样的:新的空行分隔每个块,即';;;;;;'在我的样本中。
此后的第一行以该块的唯一标识符开头。
似乎每行包含6个元素,例如key1;elem1;elem2;key2;elem3;elem4
,可以很好地表示为两条不同的行上的两个3元素向量key1;elem1;elem2
和key2;elem3;elem4
。第二个示例的示例:
"Legend number six";;
;;Number of items;19
X;-358.6806792;-358.6716338;
;;
Y;0.8767189;0.8966855;
Avg;;50.1206378
Z;-0.7654644;-0.75283;
Std;;-0.0010354
D;8.0153902;8;
Err;;1.010385
A;0;1;
Value;;0
B;1;0;
;;
;;;;;
有些是空的,但我暂时不想将其丢弃。 但我想结束一个DataFrame,其中包含每个数据块的列元素。
使用此Python代码,我最终得到了一个更有条理的“词典列表”:
import os, sys, re, glob
import pandas as pd
csvFile = os.path.join(workingDir,'file.csv')
h = 0 # Number of lines to skip in head
s = 2 # number of values per key
s += 1
str1 = 'Number of items'
# Reading file in a global list and storing each line in a sublist:
A = [line.split(';') for line in open(csvFile).read().split('\n')]
# This code splits each 6-elements sublist in one new sublist
# containing two-elements; each element with 3 values:
B = [(';'.join(el[:s])+'\n'+';'.join(el[s:])).split('\n') for el in A]
# Init empty structures:
names = [] # to store block unique identifier (the name in the legend)
L = [] # future list of dictionnaries
for el in (B):
for idx,elj in enumerate(el):
vi = elj.split(';')[1:]
# Here we grep the name only when the 2nd element of
# the first line contains the string "Number of items",
# which is constant all over the file:
if len(vi)>1 and vi[0]==str1:
name = el[idx-1].split(';')[0]
names.append(name)
#print(name)
# We loop again over B to append in a new list one dictionary
# per vector of 3 elements because each vector of 3 elements
# is structured like ; key;elem1;elem2
for el in (B):
for elj in (el):
k = elj.split(';')[0]
v = elj.split(';')[1:]
# Little tweak because the key2;elem3;elem4 of the
# first line (the one containing the name) have the
# key in the second place like "elem3;key2;elem4" :
if len(v)>1 and v[0]==str1:
kp = v[0]
v = [v[1],k]
k = kp
if k!='':
dct = {k:v}
L.append(dct)
到目前为止,我无法提取名称作为全局标识符,而将集团的所有值提取为变量。我无法使用基于模的技术,因为即使每个数据块中至少包含一些公用键,每个单独的数据块中的信息数量也会变化。
我还在每个字典的while
循环中尝试了for
条件,但是现在很混乱。
zip
可能是一个不错的选择,但我真的不知道如何正确使用它。
理想情况下,我希望最终看起来应该类似于包含以下内容的DataFrame;
index 'Number of items' 'X' '' 'Y' 'Avg' 'Z' 'Std' ...
"Legend number one" 6 ...
"Legend number six" 19 ...
"Legend number 11" 6 ...
"Legend number 15" 18 ...
列名是键,表包含一行中每个数据块的值。
是否有编号索引和带有“图例名称”的新列;没关系。
"Legend number one";;;;Number of items;6
X;8.6806792;8.6716338;;;
Y;0.1557;0.1556;Avg;;50.1206378
Z;-0.7859;-0.7860;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
;;;;;
"Legend number six";;;;Number of items;19
X;56.6806792;56.6716338;;;
Y;0.1324;0.1322;Avg;;50.1206378
Z;-0.7654644;-0.75283;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
A;0;1;Value;;0
B;1;0;;;
;;;;;
"Legend number 11";;;;Number of items;6
X;358.6806792;358.6716338;;;
Y;0.1324;0.1322;Avg;;50.1206378
Z;-0.7777;-0.7778;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
;;;;;
"Legend number 15";;;;Number of items;18
X;58.6806792;58.6716338;;;
Y;0.1324;0.1322;Avg;;50.1206378
Z;0.5555;0.5554;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
A;0;1;Value;;0
B;1;0;;;
C;0;0;k;1;0
;;;;;
我正在使用Ubuntu和Python 3.6,但该脚本也必须在Windows计算机上工作。
答案 0 :(得分:0)
将此添加到先前的代码中应该可以很好地工作:
for elem in L:
for key,val in elem.items():
if key in names:
name = key
Dict2 = {}
else:
Dict2[key] = val
Dict1[name] = Dict2
df1 = pd.DataFrame.from_dict(Dict1, orient='index')
df2 = pd.DataFrame(index=df1.index)
for col in df1.columns:
colS = df1[col].apply(pd.Series)
colS = colS.rename(columns = lambda x : col+'_'+ str(x))
df2 = pd.concat([df2[:], colS[:]], axis=1)
df2.to_csv('output.csv', sep=',', index=True, header=True)
可能还有许多其他方法...
此链接很有帮助:
https://chrisalbon.com/python/data_wrangling/pandas_expand_cells_containing_lists/