pandas - 根据列值合并几乎重复的行

时间:2016-03-28 21:22:53

标签: python pandas

我有一个// Get all the <a> elements var anchors = document.querySelectorAll('a.goog-te-menu2-item'); anchors = Array.prototype.slice.call(language_anchors); if (anchors.length < 1) { console.error('Found no language links'); } // Get the conatiner <div> that holds the table of links var div = document.getElementById(':1.menuBody'); if (div === null) { console.error('Could not find div containing table of language links'); } else { // Remove width/height attributes to have <div> resize div.style.height = ''; div.style.width = ''; // Iterate through all language links anchors.forEach(function (a) { // Set display to inline=block so its rendered like text // This is what gets the elements onto a new line if they don't fit a.style.display = 'inline-block'; // Append them directly to the <div> div.appendChild(a); }); // Remove the now empty <table> to keep things clean div.removeChild(div.querySelector('table')); } 数据框,其中有几行几乎是彼此重复的,除了一个值。我的目标是将这些行合并或“合并”成一行,而不对数值求和。

以下是我正在使用的示例:

pandas

这就是我想要的:

Name   Sid   Use_Case  Revenue
A      xx01  Voice     $10.00
A      xx01  SMS       $10.00
B      xx02  Voice     $5.00
C      xx03  Voice     $15.00
C      xx03  SMS       $15.00
C      xx03  Video     $15.00

我不想总结“收入”列的原因是因为我的表是在几个时间段内进行转移的结果,其中“收入”最终被多次列出而不是每个具有不同的值“Use_Case”。

解决此问题的最佳方法是什么?我查看了Name Sid Use_Case Revenue A xx01 Voice, SMS $10.00 B xx02 Voice $5.00 C xx03 Voice, SMS, Video $15.00 函数,但我仍然不太了解它。

3 个答案:

答案 0 :(得分:22)

我认为您可以将groupbyaggregate first和自定义函数', '.join一起使用:

df = df.groupby('Name').agg({'Sid':'first', 
                             'Use_Case': ', '.join, 
                             'Revenue':'first' }).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

评论的好主意,谢谢Goyo

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()

#change column order                           
print df[['Name','Sid','Use_Case','Revenue']]                              
  Name   Sid           Use_Case Revenue
0    A  xx01         Voice, SMS  $10.00
1    B  xx02              Voice   $5.00
2    C  xx03  Voice, SMS, Video  $15.00

答案 1 :(得分:2)

我使用的是一些我认为不是最佳的代码,最终找到jezrael's answer。但是在使用它并运行timeit测试之后,我实际上回到了我正在做的事情,这是:

cmnts = {}
for i, row in df.iterrows():
    while True:
        try:
            if row['Use_Case']:
                cmnts[row['Name']].append(row['Use_Case'])

            else:
                cmnts[row['Name']].append('n/a')

            break

        except KeyError:
            cmnts[row['Name']] = []

df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]

根据我的100运行timeit测试,迭代和替换方法比groupby方法快一个数量级。

import pandas as pd
from my_stuff import time_something

df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
                   'b': [i for i in range(1, 10001)]})

runs = 100

interim_dict = 'txt = {}\n' \
               'for i, row in df.iterrows():\n' \
               '    try:\n' \
               "        txt[row['a']].append(row['b'])\n\n" \
               '    except KeyError:\n' \
               "        txt[row['a']] = []\n" \
               "df.drop_duplicates('a', inplace=True)\n" \
               "df['b'] = ['; '.join(v) for v in txt.values()]"

grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"

print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))

的产率:

Interim Dict
  Total: 59.1164s
  Avg: 591163748.5887ns

Group By
  Total: 430.6203s
  Avg: 4306203366.1827ns

其中time_something是一个函数,它使用timeit对代码段进行计时,并以上述格式返回结果。

答案 2 :(得分:1)

您可以groupbyapply list功能:

>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
    Name    Sid     Revenue     0
0   A   xx01    $10.00  [Voice, SMS]
1   B   xx02    $5.00   [Voice]
2   C   xx03    $15.00  [Voice, SMS, Video]

(如果您担心重复,请使用set代替list。)