Question

pandas中的操作是否与pyspark中的flatMap相同？

flatMap示例：

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

到目前为止，我可以想到apply后跟itertools.chain，但我想知道是否有一步解决方案。

Answer 1

有一个黑客。我经常做像

这样的事情

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

NaN的引入是因为中间对象创建了MultiIndex，但对于很多事情你可以放弃它：

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

这个技巧使用了所有的pandas代码，所以我希望它的效率相当高，尽管它可能不喜欢大小不同的列表。

Answer 2

我怀疑答案是“不，不高效。”

Pandas不是为这样的嵌套数据构建的。我怀疑你在Pandas考虑的情况看起来有点像下面这样：

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

你需要类似下面的内容

在将数据发送到Pandas之前，用normalize Python中的数据更为典型。如果Pandas做到了这一点，它可能只能以慢速Python速度而不是快速C速度运行。

通常，在使用表格计算之前，会对数据进行一些重复处理。

Answer 3

解决这个问题有三个步骤。

<div class="panel panel-primary" style="max-height:66vh; min-width:100vh">
   <div class="panel-heading" style="max-height:15vh">
      @Html.DisplayFor(modelItem => item.Name) - @Html.DisplayFor(modelItem => item.department) - @Html.DisplayFor(modelItem => item.position)
   </div>
   <div class="panel-body">
      <div style="float:left; margin:6px;">
         <img id="profileImg" height="155" width="155" />
      </div>
      <div style="float:left; margin:6px; margin-right:10px; min-width:50vh">
         @Html.TextAreaFor(modelItem => item.description, 7, 200, new { @class = "form-control", @readonly = true })
      </div>
      <div style="float:left; margin:6px; min-width:50vh">
         <h5>Comments:</h5>
         @Html.TextAreaFor(modelItem => item.comment, new { @class = "form-control", @cols = 70, @rows = 5 })
      </div>
   </div>
</div>

Answer 4

自 2019 年 7 月起，Pandas 提供 pd.Series.explode 来取消嵌套框架。这是基于爆炸和地图的 pd.Series.flatmap 的可能实现。为什么？

flatmap 操作应该是 map 的子集，而不是 apply。查看此线程以了解 map/applymap/apply 详细信息 Difference between map, applymap and apply methods in Pandas

import pandas as pd
from typing import Callable

def flatmap(
    self,
    func:Callable[[pd.Series],pd.Series],
    ignore_index:bool=False):
    return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap

# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
#    A   B
# 0  1   6
# 1  2   7
# 2  3   8
# 3  4   9
# 4  5  10
print(df.A.flatmap(range,False))
# 0    NaN
# 1      0
# 2      0
# 2      1
# 3      0
# 3      1
# 3      2
# 4      0
# 4      1
# 4      2
# 4      3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0     0
# 1     0
# 2     1
# 3     0
# 4     1
# 5     2
# 6     0
# 7     1
# 8     2
# 9     3
# 10    0
# 11    1
# 12    2
# 13    3
# 14    4
# Name: A, dtype: object

如您所见，主要问题是索引。您可以忽略它并重新设置，但是最好使用 NumPy 或 std 列表，因为索引是 Pandas 的关键点之一。如果您根本不关心索引，则可以重复使用上述解决方案的想法，将 pd.Series.map 更改为 pd.DataFrame.applymap 并将 pd.Series.explode 更改为 pd.DataFrame.explode 并强制 {{1} }.

pyspark在pandas的平面地图

4 个答案: