我有一个大熊猫数据框,每季度都有坚定的观察结果,并且每个公司的观察结果都来自不同的人。因此,我有"普通"变量如年份,标题,公司名称等,然后每季度观察我有一个变量allinfolistmain,它存储为每个观察中的列表列表,其中包含名称和语音作为单独的列表条目。
例如,对于一行" allinfolistmain"条目看起来像这样:
[[Mark Johnson, Hello], [Christina Brown, Have a good day], [Mark Johnson, You too], [Christina Brown, Thank you]]
整体数据框如下所示:
Index Year Title Firm allinfolistmain
0 2009 CC A 2009 A [[Mark Johnson, Hello], [Christina Brown, Have a good day], [Mark Johnson, You too], [Christina Brown, Thank you]]
1 2009 CC B 2009 B [[Lucas Bass, Hello], [Harm Brown, Have a good day], [Lucas Bass, You too], [Harm Brown, Thank you]]
2 2008 CC A 2008 A [[Mark Johnson, Nice to see you], [Christina Brown, You too], [Mark Johnson,Thanks], [Christina Brown, Bye]]
现在对于每一行/观察,我想按名称对语句进行分组(因此列表元素索引为1)(因此列表元素索引为0),因此下面的语句只是在一个字符串中连接在一起清单:
[[Mark Johnson, Hello You too], [Christina Brown, Have a good day Thank you]]
有人可以在这里帮我解释一下我如何通过每一行来创建这样一个新列表吗?所有建议都非常受欢迎,因为我仍处于编码的开始阶段,我无法解决这个问题。
非常感谢你! 朱莉娅
答案 0 :(得分:0)
如果我正确理解了您的问题以及您是如何创建数据框的,那么这就是您想要做的吗?最后是打印的list
:
# a new dictionary of lists to collect all "speeches" values for each "name" key
nd = {}
for row in df['allinfolistmain']: # for each row in the dataframe
for n in row: # for each name in the row
try: #
if nd[n[0]]: # check if the key already exists
nd[n[0]].append(n[1]) # if it does, add speech to its list
except KeyError: # otherwise they key doesn't yet exist
nd[n[0]] = [n[1]] # we add the key and the speech
newlist = [] # create a new list
for k, v in nd.iteritems(): # for each key, value in the new dictionary from previous step
newlist.append((k, ' '.join(v))) # add a tuple of (key, all speeches) as one string
print newlist
输出:
[('Christina Brown', 'Have a good day Thank you You too Bye'),
('Mark Johnson', 'Hello You too Nice to see you Thanks'),
('Lucas Bass', 'Hello You too'),
('Harm Brown', 'Have a good day Thank you')]
答案 1 :(得分:0)
from collections import defaultdict
def g(L):
res = defaultdict(list)
for v, k in L:
res[v].append(k)
new = list({key: ' '.join(value) for key, value in res.items()}.items())
return new
df.allinfolismain.apply(g)
单一列表测试:
L=[('Mark Johnson', 'Hello'), ('Christina Brown', 'Have a good day'), ('Mark Johnson', 'You too'), ('Christina Brown', 'Thank you')]
g(L)
Out[784]:
[('Mark Johnson', 'Hello You too'),
('Christina Brown', 'Have a good day Thank you')]