Question

出于上下文原因，请想象一个包含主题和帖子的互联网论坛。每个主题和帖子都有其自己的唯一标识符。此外，可以将多个帖子链接到一个线程（您也可以想象人（=帖子）及其所属的家族（=线程））：

我在熊猫中有一个数据框，其中有两列：thread_id和post_id。我数据框中的每一行都是该论坛中的帖子。 thread_id显示帖子所属的线程，post_id显示帖子的唯一标识符。

我现在想添加第三列：thread_size。它显示帖子属于哪种线程。此列具有三个不同值之一：small，medium或big。选择的值取决于您猜到的线程的大小。有两个阈值（上下两个阈值）可用来测量线程的大小。

我尝试按线程对帖子进行分组，然后使用for循环以及if，elif和else语句设置thread_size的大小。但这似乎不起作用：

forum["thread_size"] = np.nan

for thread_id, frame in forum.groupby(["thread_id"]):
    post_count = frame.size
    if post_count > 400:
        frame["thread_size"] = "big"
    elif post_count > 300:
        frame["thread_size"] = "medium"
    else:
        frame["thread_size"] = "small"

编辑：将forum视为一个城市（数据框），其中有一些属于家庭的人。我数据框中的每一行都代表一个属于一个家庭（线程）的人（帖子）。我想用名为family-size的列扩展city-dataframe。这样，每个人（行）现在都拥有他们所属的家庭以及是否属于big，medium或small家庭的信息：之前：

[name]    [family]   
 oscar     potter       
 frederic  minamisawa  
 blerim    meier       
 marina    minamisawa

之后：

[name]    [family]     [family-size]
 oscar     potter       small
 frederic  minamisawa   big
 blerim    meier        medium
 marina    minamisawa   big

Answer 1

使用pd.cut将它们放入垃圾箱。首先是一些模拟数据：

n = 1000
np.random.seed(2)

# I want to bias the threads such that thread 1 has a decent chance
# of being "large", 2 is "medium" and 3 and 4 are "small"
thread_id = np.random.choice([1,2,3,4], size=n, p=[0.4, 0.3, 0.2, 0.1])

# post_id is unique, may as well be sequential
post_id = np.arange(n)

# The dataframe
forum = pd.DataFrame({
    'thread_id': thread_id,
    'post_id': post_id
})

现在解决您的问题：

stat = forum.groupby('thread_id').size().to_frame('count')
stat['size'] = pd.cut(stat['count'], [0, 300, 400, np.inf], labels=['small', 'medium', 'large'])

pd.cut函数将stat['count']系列分成3个bin：

（0-300]：小
（300-400]：中等
（400-inf]：大

结果：

           count    size
thread_id               
1            416   large
2            313  medium
3            165   small
4            106   small

Answer 2

找到了我想要的东西：

thread_category = lambda x: 2 if x > 400 else (1 if x > 300 else 0)

forum["thread_size"] = forum.groupby('thread_id')["post_id"].transform(lambda x: thread_category(len(x)))

根据组大小对行进行分类

2 个答案: