Question

我有一个巨大的文件（20,000行），有2列（id和value）。有些ID具有不同的价值。我想写一个for循环来给我ids的所有值。

顺便说一句，我正在使用pandas并将数据作为数据框导入。

例如：该文件是：

id  value 
a    2
a    3
b    2
c    4
b    5

我希望结果如下：

a 2,3
b 2,5
c 4

谢谢

Answer 1

将groupby与apply join一起使用。显然，如果需要将数字列value强制转换为string：

print (df.groupby('id')['value'].apply(lambda x: ','.join(x.astype(str))).reset_index())
  id value
0  a   2,3
1  b   2,5
2  c     4

<强>计时：

np.random.seed(123)
N = 1000000
L = list("ABCDEFGHIJKLMNO")
df = pd.DataFrame({'id':np.random.choice(L, N), 
                   'value': np.random.randint(10, size=N)})
#[1000000 rows x 2 columns]                   
print (df)

In [84]: %timeit (df.groupby('id')['value'].apply(lambda x: ','.join(x.astype(str))).reset_index())
1 loop, best of 3: 1.46 s per loop

In [85]: %timeit (df.astype(str).groupby('id').value.apply(','.join).reset_index())
1 loop, best of 3: 1.83 s per loop

Answer 2

IIUC：
你想要一个值列表

df.groupby('id').value.apply(list)

id
a    [2, 3]
b    [2, 5]
c       [4]
Name: value, dtype: object

如果你想要字符串......这是@ jezrael的答案，只是根据我的口味修改

df.astype(str).groupby('id').value.apply(','.join)

id
a    2,3
b    2,5
c      4
Name: value, dtype: object

实验numpy解决方案

u, i = np.unique(df.id.values, return_inverse=True)
g = np.arange(len(u))[:, None] == i

def slc(r):
    return df.value.values[r].tolist()

pd.Series(list(map(slc, g)), u)

a    [2, 3]
b    [2, 5]
c       [4]
dtype: object

表示字符串

u, i = np.unique(df.id.values, return_inverse=True)
g = np.arange(len(u))[:, None] == i

def slc(r):
    return ','.join(map(str, df.value.values[r].tolist()))

pd.Series(list(map(slc, g)), u)

a    2,3
b    2,5
c      4
dtype: object

时间

np.random.seed(123)
N = 1000000
L = list("ABCDEFGHIJKLMNO")
df = pd.DataFrame({'id':np.random.choice(L, N), 
                   'value': np.random.randint(10, size=N)})

代码

def pir1(df):
    return df.astype(str).groupby('id').value.apply(','.join)

def pir2(df):
    u, i = np.unique(df.id.values, return_inverse=True)
    g = np.arange(len(u))[:, None] == i

    def slc(r):
        return ','.join(map(str, df.value.values[r].tolist()))

    return pd.Series(list(map(slc, g)), u, name='value')

def pir3(df):
    return df.groupby('id').value.apply(list)

def pir4(df):
    u, i = np.unique(df.id.values, return_inverse=True)
    g = np.arange(len(u))[:, None] == i

    def slc(r):
        return df.value.values[r].tolist()

    return pd.Series(list(map(slc, g)), u, name='value')

def jez1(df):
    return df.groupby('id')['value'].apply(lambda x: ','.join(x.astype(str)))

<强> 结果
注意： pir1和pir2是字符串结果。 pir3和pir4是列表结果。

用于循环python以连接2列中的数据

2 个答案: