我正在尝试从文本文件(.txt)更改数据结构,该数据如下所示:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
我想将它们转换为这种格式(例如excel中的数据透视表,其列名是“:”之间的字符,每个组始终以:1:开头)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
有人知道吗?预先感谢。
答案 0 :(得分:1)
首先通过read_csv
和header=None
创建DataFrame,因为文件中没有标题:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
通过DataFrame.pop
提取原始列,然后通过Series.str.strip
和Series.str.split
值将对立:
删除到2个新列。然后通过与字符串Series.eq
与字符串==
与0
比较Series.cumsum
为df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
创建组,由DataFrame.set_index
创建MultiIndex,最后由Series.unstack
重塑:< / p>
node: error while loading shared libraries: libicui18n.so.62: cannot open shared object file: No such file or directory
答案 1 :(得分:0)
使用:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
答案 2 :(得分:0)
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None