我有一个我想按行应用的功能:
def item_split(row):
items = json.loads(row['items'])
out = pd.DataFrame([row for i in range(len(items))])
out['item'] = items
return out
tweets = tweets.apply(tag_split, axis=1)
正如您所知,此函数用于获取项目列表,并为每个项目创建一行,以复制其余剩余数据。不幸的是,我当前的方法不是apply方法的正确用法:
ValueError Traceback (most recent call last)
/usr/lib/python3.4/site-packages/pandas/core/common.py in _asarray_tuplesafe(values, dtype)
2344 result = np.empty(len(values), dtype=object)
-> 2345 result[:] = values
2346 except ValueError:
ValueError: could not broadcast input array from shape (13) into shape (1)
有谁知道如何正确地做到这一点?我有点难过。
答案 0 :(得分:1)
这个问题与Wes McKinney pandas: apply function to DataFrame that can return multiple rows的has answered类似。
说你的数据是这样的:
In [36]: tweets = pd.DataFrame({
....: 'items': [
....: '[{"text": "user1-msg1"},{"text": "user1-msg2"},{"text": "user1-msg3"}]',
....: '[{"text": "user2-msg1"},{"text": "user2-msg2"}]',
....: '[{"text": "user3-msg1"}]',
....: ],
....: 'user': ['user1', 'user2', 'user3'],
....: })
您可以.groupby()
与group_keys=False
一起使用,为每个分组项目返回多行:
In [37]: def item_split(group):
....: row = group.irow(0)
....: result = pd.DataFrame(json.loads(row['items']))
....: result['user'] = row['user']
....: return result
....:
In [38]: tweets.groupby('items', group_keys=False).apply(item_split)
Out[38]:
text user
0 user1-msg1 user1
1 user1-msg2 user1
2 user1-msg3 user1
0 user2-msg1 user2
1 user2-msg2 user2
0 user3-msg1 user3