熊猫读取一个文本文件,并根据第一个字符将名称分成几列

时间:2019-07-05 15:39:18

标签: python-3.x pandas

嗨,我期待着我们是否可以读取文本文件,并将其放置在基于熊猫的第一个字符的单独列中。

下面是文本文件

$ cat file.txt
AAAAAA
AAAAAA
AAAAAA
AAAAAA
AAAAAA
BBBBBB
BBBBBB
BBBBBB
BBBBBB
BBBBBB
CCCCCC
CCCCCC
CCCCCC
CCCCCC
CCCCCC
DDDDDD
DDDDDD
DDDDDD
DDDDDD
DDDDDD
EEEEEE
EEEEEE
EEEEEE
EEEEEE
EEEEEE
FFFFFF
FFFFFF
FFFFFF
FFFFFF
FFFFFF

所需:

COL_1   COL_2   COL_3   COL_4   COL_5   COL_6
AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF

3 个答案:

答案 0 :(得分:3)

可能不是最好的方法:

# notice the header=None option
df = pd.read_csv('file.txt', header=None)

# extract the first character of the string
df['start'] = df[0].str[0]

# group by the first character of the string
# cumcount gives you the order/rank of the row within its group
df['idx'] = df.groupby('start').cumcount()

# pivot - search StackOverflow for 47152691
df.pivot(index='idx', columns='start', values=0)

输出:

start       A       B       C       D       E       F
idx                                                  
0      AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
1      AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
2      AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
3      AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
4      AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF

答案 1 :(得分:3)

from_dict

d = {}
for line in open('file.txt').read().splitlines():
    d.setdefault(line[0], []).append(line)

pd.DataFrame.from_dict(d, orient='index').T

        A       B       C       D       E       F
0  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
1  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
2  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
3  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
4  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF

类似但不一次读取整个文件

d = {}
for line in open('file.txt'):
    d.setdefault(line[0], []).append(line.strip('\n'))

pd.DataFrame.from_dict(d, orient='index').T

答案 2 :(得分:3)

另一种方式是(假设Col是列名):

m=df.assign(k=(pd.factorize(df.Col)[0]+1).astype(str)).groupby('k')['Col'].apply(list)
pd.DataFrame(m.values.tolist(),index='Col_'+m.index).T

    Col_1   Col_2   Col_3   Col_4   Col_5   Col_6
0  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
1  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
2  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
3  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
4  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF

另一项不适用:

m=(df.assign(k=(pd.factorize(df.Col)[0]+1).astype(str),s=df.groupby('Col').cumcount())
 .set_index(['s','k'])).unstack().rename_axis(None)
m.columns=m.columns.map('_'.join)

    Col_1   Col_2   Col_3   Col_4   Col_5   Col_6
0  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
1  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
2  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
3  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF
4  AAAAAA  BBBBBB  CCCCCC  DDDDDD  EEEEEE  FFFFFF