我有一个文本文件列表,这些文件需要在一个数据框中,因此我读取了文件并将它们连接为一个。但是,结果数据框具有多列(452列),但是我想将此数据框重塑为自定义的数据框。我的意思是我只想有两列,例如0
和1
列;这是我的数据的样子:
这是我对数据进行的尝试:
import pandas as pd
allfiles=glob.glob('C:\\fake\\*.txt')
dfs=pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)
现在,我想用两列,例如0
和1
来简单地重塑结果数据框。我怎样才能做到这一点?有什么主意吗?
更新:所需的输出:
这是我的预期输出(仅作为示例):
d = {'headline': ["Alex Jones Vindicated something", "California Surprisingly ", "Mexicans Are Chomping something"],
'context': ["Alex Jones, purveyor of somethig long text", "Setting Up Face-Off With Trump ", "Mexico has been unfairly "]}
pd.DataFrame(data=d)
更新2:原始数据
这是原始文本文件的外观(我正在将多个文本文件读取到只有两列的一个数据框中):
texttexttexttexttexttexttexttexttexttext
longtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtextlongtext
答案 0 :(得分:2)
简单地摆脱最外面的轴规格;即代替
In [44]: pd.concat([pd.read_csv(file, header = None, sep = '\n', quoting=3, skip_blank_lines = True).T for file in allfiles], axis=1)
Out[44]:
0 1 0 1 0 1
0 test1a test1b test2a test2b test3a test3b
做
In [45]: pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines=True).T for file in allfiles)
Out[45]:
0 1
0 test1a test1b
0 test2a test2b
0 test3a test3b
编辑,因为该帖子已被编辑:
例如,使用以下输入:
In [79]: !cat blah.test
test1a
test1b
In [80]: !cat blah2.test
test2a
test2b
In [81]: !cat blah3.test
test3a
test3b
In [82]: allfiles
Out[82]: ['blah.test', 'blah2.test', 'blah3.test']
我们得到所需的输出:
In [83]: pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines=True).T for file in allfiles)
Out[83]:
0 1
0 test1a test1b
0 test2a test2b
0 test3a test3b
根据以下评论,编辑#2 :
您的文件中至少有一个包含两个以上的非空行,并且需要进一步处理。就您而言,我可能会做类似的事情
In [169]: df = pd.concat(pd.read_csv(file, header=None, sep='\n', quoting=3, skip_blank_lines = True).T for file in allfiles).reset_index(drop=True).fillna('')
In [170]: df_clean = pd.DataFrame({'headline': df[0], 'context': df.loc[:, 1:].apply(' '.join, axis=1)})
In [171]: df_clean.head()
Out[171]:
headline context
0 Alex Jones Vindicated in "Pizzagate" Controversy "Alex Jones, purveyor of the independent inves...
1 THE BIG DATA CONSPIRACY Government and Silicon Valley are looking to e...
2 California Surprisingly Lenient on Auto Emissi... Setting Up Face-Off With Trump "California's c...
3 Mexicans Are Chomping at the Bit to Stop NAFTA... Mexico has been unfairly gaining from NAFTA as...
4 Breaking News: Snapchat to purchase Twitter fo... Yahoo and AOL could be extremely popular over ...