我有一个数据drame,其中包含两个感兴趣的文件:docID和类别。请注意,实际内容也是此数据框的一部分以及其他字段
JAN001新闻,体育
JAN212政治
FEB208业务,新闻
我正在尝试使用Pandas创建一个新的数据框,如下所示:
JAN001新闻
JAN001体育
JAN212政治 ...
我知道我可以通过数据帧循环,但对熊猫来说是新手,并认为有一种方法可以更有效地完成这项工作。我曾尝试过几个问题并尝试各种例子,但迄今为止都没有成功。如果索引是解决方案的一部分,我也很好奇,但还没有探索这个途径。感谢您提供任何帮助或建议。
更新 - 这是代码和
{
foo = pd.read_csv("dtu_topic.txt", sep = "\t")
foo = foo[:20]
print foo
# id dtu_docid dtu_topic \
#0 21523 2012-1553 Energy Taxation,State & Local Taxation
#1 21522 2012-1552 Legislation & Policy\Financial Services
#2 25470 2010-0227 Quantitative Economics and Statistics
#3 25477 2010-0215 International Taxation\Asia
#4 21539 2012-1529 Ernst & Young Newsletters\This Week in Tax Reform
#5 25483 2010-0207 State & Local Taxation
#6 21536 2012-1533 Payroll & Employment Tax\State & Local
#7 21537 2012-1532 Payroll & Employment Tax\State & Local
#8 24943 2010-0929 IRS Practice & Procedure,Tax Quality & Risk Ma...
#9 25500 2010-0185 Financial Services Industries
#10 21542 2012-1524 Payroll & Employment Tax\State & Local
#11 21551 2012-1507 Personal Finance
#12 25523 2010-0159 International Taxation\Europe
#13 21549 2012-1510 Payroll & Employment Tax\State & Local
#14 21557 2012-1501 Payroll & Employment Tax\Federal,Payroll & Emp...
#15 21558 2012-1498 Accounting Methods & Inventories
#16 25567 2010-0104 Real Estate
#17 25529 2010-0152 Financial Services Industries,International Ta...
#18 21564 2012-1495 IRS Practice & Procedure
#19 21563 2012-1494 Payroll & Employment Tax\Federal
#parse dtu_topic into a list of categories
foo["dtu_topic_split"] = foo.dtu_topic.str.replace(',','\\')
foo["dtu_topic_split"] = foo.dtu_topic_split.str.split('\\').tolist()
# from example on stack overflow - get syntax error
dcm = foo[,list(dtu_docid = dtu_docid,
dtu_topic = unlist(dtu_topic.split),
by = 1:nrow(foo)]
#dt.2 <- dt[, list(Probe.Id = Probe.Id,
# Gene.Id = unlist(Gene.Id_split),
# Score.d = Score.d), by = 1:nrow(dt)]
#dcm = unlist(foo.dtu_topic_split)
print dcm
}
答案 0 :(得分:0)
看起来你正试图将一组列表变成有用的东西(你的例子实际上只有你感兴趣的列中有一个列表)
尝试这样的事情
In [101]: df = DataFrame(dict(A = [['foo','bar','bah']], B = [['foo','bah']], C = [['foo']]),index=range(4))
In [102]: df
Out[102]:
A B C
0 [foo, bar, bah] [foo, bah] [foo]
1 [foo, bar, bah] [foo, bah] [foo]
2 [foo, bar, bah] [foo, bah] [foo]
3 [foo, bar, bah] [foo, bah] [foo]
In [103]: concat(dict([ (row[0],row[1].apply(lambda y: Series(y))) for row in df.iterrows() ]))
Out[103]:
0 1 2
0 A foo bar bah
B foo bah NaN
C foo NaN NaN
1 A foo bar bah
B foo bah NaN
C foo NaN NaN
2 A foo bar bah
B foo bah NaN
C foo NaN NaN
3 A foo bar bah
B foo bah NaN
C foo NaN NaN