拆分pandas数据帧行并创建新数据帧

时间:2013-06-29 15:07:08

标签: pandas

我有一个数据drame,其中包含两个感兴趣的文件:docID和类别。请注意,实际内容也是此数据框的一部分以及其他字段

JAN001新闻,体育

JAN212政治

FEB208业务,新闻

我正在尝试使用Pandas创建一个新的数据框,如下所示:

JAN001新闻

JAN001体育

JAN212政治 ...

我知道我可以通过数据帧循环,但对熊猫来说是新手,并认为有一种方法可以更有效地完成这项工作。我曾尝试过几个问题并尝试各种例子,但迄今为止都没有成功。如果索引是解决方案的一部分,我也很好奇,但还没有探索这个途径。感谢您提供任何帮助或建议。


更新 - 这是代码和

{

foo = pd.read_csv("dtu_topic.txt", sep = "\t") 
foo = foo[:20]

print foo

#    id  dtu_docid                                          dtu_topic  \
#0   21523  2012-1553             Energy Taxation,State & Local Taxation
#1   21522  2012-1552            Legislation & Policy\Financial Services
#2   25470  2010-0227              Quantitative Economics and Statistics
#3   25477  2010-0215                        International Taxation\Asia
#4   21539  2012-1529  Ernst & Young Newsletters\This Week in Tax Reform
#5   25483  2010-0207                             State & Local Taxation
#6   21536  2012-1533             Payroll & Employment Tax\State & Local
#7   21537  2012-1532             Payroll & Employment Tax\State & Local
#8   24943  2010-0929  IRS Practice & Procedure,Tax Quality & Risk Ma...
#9   25500  2010-0185                      Financial Services Industries
#10  21542  2012-1524             Payroll & Employment Tax\State & Local
#11  21551  2012-1507                                   Personal Finance
#12  25523  2010-0159                      International Taxation\Europe
#13  21549  2012-1510             Payroll & Employment Tax\State & Local
#14  21557  2012-1501  Payroll & Employment Tax\Federal,Payroll & Emp...
#15  21558  2012-1498                   Accounting Methods & Inventories
#16  25567  2010-0104                                        Real Estate
#17  25529  2010-0152  Financial Services Industries,International Ta...
#18  21564  2012-1495                           IRS Practice & Procedure
#19  21563  2012-1494                   Payroll & Employment Tax\Federal

#parse dtu_topic into a list of categories
foo["dtu_topic_split"] = foo.dtu_topic.str.replace(',','\\')
foo["dtu_topic_split"] = foo.dtu_topic_split.str.split('\\').tolist()

# from example on stack overflow - get syntax error
dcm = foo[,list(dtu_docid = dtu_docid,
           dtu_topic = unlist(dtu_topic.split),
           by = 1:nrow(foo)]


                 #dt.2 <- dt[, list(Probe.Id = Probe.Id,
                 #                      Gene.Id = unlist(Gene.Id_split),
                 #                      Score.d = Score.d), by = 1:nrow(dt)]

#dcm = unlist(foo.dtu_topic_split)

print dcm

}

1 个答案:

答案 0 :(得分:0)

看起来你正试图将一组列表变成有用的东西(你的例子实际上只有你感兴趣的列中有一个列表)

尝试这样的事情

In [101]: df = DataFrame(dict(A = [['foo','bar','bah']], B = [['foo','bah']], C = [['foo']]),index=range(4))

In [102]: df
Out[102]: 
                 A           B      C
0  [foo, bar, bah]  [foo, bah]  [foo]
1  [foo, bar, bah]  [foo, bah]  [foo]
2  [foo, bar, bah]  [foo, bah]  [foo]
3  [foo, bar, bah]  [foo, bah]  [foo]

In [103]: concat(dict([ (row[0],row[1].apply(lambda y: Series(y))) for row in df.iterrows() ]))
Out[103]: 
       0    1    2
0 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
1 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
2 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN
3 A  foo  bar  bah
  B  foo  bah  NaN
  C  foo  NaN  NaN