如何将多列数据框重塑为预期数据框?

时间:2019-02-24 12:49:19

标签: python pandas

我有一个文本文件列表,这些文件需要在一个数据框中,因此我读取了文件并将它们连接为一个。但是,结果数据框具有多列(452列),但是我想将此数据框重塑为自定义的数据框。我的意思是我只想有两列,例如01列;这是我的数据的样子:

enter image description here

这是我对数据进行的尝试:

import pandas as pd

allfiles=glob.glob('C:\\fake\\*.txt')
dfs=pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)

现在,我想用两列,例如01来简单地重塑结果数据框。我怎样才能做到这一点?有什么主意吗?

更新:所需的输出

这是我的预期输出(仅作为示例):

d = {'headline': ["Alex Jones Vindicated  something", "California Surprisingly ", "Mexicans Are Chomping something"], 
     'context': ["Alex Jones, purveyor of somethig long text", "Setting Up Face-Off With Trump ", "Mexico has been unfairly "]}

 pd.DataFrame(data=d)

更新2:原始数据

这是原始文本文件的外观(我正在将多个文本文件读取到只有两列的一个数据框中):

texttexttexttexttexttexttexttexttexttext

longtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtext

1 个答案:

答案 0 :(得分:2)

简单地摆脱最外面的轴规格;即代替

In [44]: pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)
Out[44]:
        0       1       0       1       0       1
0  test1a  test1b  test2a  test2b  test3a  test3b

In [45]: pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines=True).T for file in allfiles)
Out[45]:
        0       1
0  test1a  test1b
0  test2a  test2b
0  test3a  test3b

编辑,因为该帖子已被编辑:

例如,使用以下输入:

In [79]: !cat blah.test
test1a

test1b
In [80]: !cat blah2.test
test2a

test2b
In [81]: !cat blah3.test
test3a

test3b
In [82]: allfiles
Out[82]: ['blah.test', 'blah2.test', 'blah3.test']

我们得到所需的输出:

In [83]: pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines=True).T for file in allfiles)
Out[83]:
        0       1
0  test1a  test1b
0  test2a  test2b
0  test3a  test3b
根据以下评论,

编辑#2

您的文件中至少有一个包含两个以上的非空行,并且需要进一步处理。就您而言,我可能会做类似的事情

In [169]: df = pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines = True).T for file in allfiles).reset_index(drop=True).fillna('')

In [170]: df_clean = pd.DataFrame({'headline': df[0], 'context': df.loc[:, 1:].apply(' '.join, axis=1)})

In [171]: df_clean.head()
Out[171]:
                                            headline                                            context
0   Alex Jones Vindicated in "Pizzagate" Controversy  "Alex Jones, purveyor of the independent inves...
1                            THE BIG DATA CONSPIRACY  Government and Silicon Valley are looking to e...
2  California Surprisingly Lenient on Auto Emissi...  Setting Up Face-Off With Trump "California's c...
3  Mexicans Are Chomping at the Bit to Stop NAFTA...  Mexico has been unfairly gaining from NAFTA as...
4  Breaking News: Snapchat to purchase Twitter fo...  Yahoo and AOL could be extremely popular over ...