.txt文件中包含以下数据:
Ord
我想将其读入具有以下结构的pandas数据框中:
LC xx1
Name y1 y2 y3
A 10 12 13
B 9 11 15
C 7 15 16
LC xy2
Name y1 y2 y3
A 11 12 19
B 20 37 20
C 40 15 1
有人对如何以编程方式执行此操作有想法吗? 我必须在10MB的大文件中完成此操作。
谢谢。
J.A。
答案 0 :(得分:0)
您可以使用:
temp=u""" LC xx1
Name y1 y2 y3
A 10 12 13
B 9 11 15
C 7 15 16
LC xy2
Name y1 y2 y3
A 11 12 19
B 20 37 20
C 40 15 1"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.txt'
#set names parameter by number of columns
df = pd.read_csv(pd.compat.StringIO(temp), sep="\s+", names=range(4))
print (df)
0 1 2 3
0 LC xx1 NaN NaN
1 Name y1 y2 y3
2 A 10 12 13
3 B 9 11 15
4 C 7 15 16
5 LC xy2 NaN NaN
6 Name y1 y2 y3
7 A 11 12 19
8 B 20 37 20
9 C 40 15 1
#set columns names by second row
df.columns = df.iloc[1]
#remove index name 1
df.columns.name = None
#get mask by comapring LC
mask = df['Name'] == 'LC'
#create new column by mask with forward filling missing values
df.insert(0, 'LC', df['y1'].where(mask).ffill())
#remove rows with LC and columns names
df = df[~mask & (df['Name'] != 'Name')].reset_index(drop=True)
print (df)
LC Name y1 y2 y3
0 xx1 A 10 12 13
1 xx1 B 9 11 15
2 xx1 C 7 15 16
3 xy2 A 11 12 19
4 xy2 B 20 37 20
5 xy2 C 40 15 1
另一个python解决方案:
items = []
cols = []
with open('file.txt') as f:
LC = ''
#loop by each line
for i, line in enumerate(f):
#remove traling new line char and split by whitespace
l = line.rstrip('\n').split()
#store columns names
if (i == 1):
cols = l
#store value next LC
if (len(l) == 2) and (l[0] == 'LC'):
LC = l[1]
#store each line, remove empty lists
elif (len(l) > 2) and (l[0] != 'Name'):
items.append([LC] + l)
#create DataFrame
df = pd.DataFrame(items, columns=['LC'] + cols)
#if necessary convert columns to integers
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)
print (df)
LC Name y1 y2 y3
0 xx1 A 10 12 13
1 xx1 B 9 11 15
2 xx1 C 7 15 16
3 xy2 A 11 12 19
4 xy2 B 20 37 20
5 xy2 C 40 15 1