如何在DataFrame中合并常见行

时间:2020-07-11 19:23:07

标签: python python-3.x pandas pandas-groupby

我正在对银行对帐单(csv)进行分析。诸如麦当劳之类的某些商品各有一行(由于地址不同)。

我正在尝试通过一个常用短语来组合这些行。因此,在此示例中,显而易见的短语或字符串为“ McDonalds”。我认为这将是一个if语句。

此外,该列的dtype为“对象”。我需要将其转换为字符串格式吗?

这是我的代码中打印结果totali = df.Item.value_counts()的示例输出。

理想情况下,我希望该行将McDonalds输出为单行。 在csv中,它们是2个单独的行。

foo                                   14
Restaurant Boulder CO                  8
McDonalds Boulder CO                   5
McDonalds Denver CO                    5

这是列数据的组成部分

'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO'  'Chambers Place Liquors 303-3731100 CO'  "Mcdonald's F26593 Fort Collins CO"  'Suh Sushi Korean Bbq Fort Collins CO'  'Conoco - Sei 26927 Fort Collins CO'

1 个答案:

答案 0 :(得分:0)

好。我想我发现了一些有用的东西。意识到从文本字符串推断类别或名称的任务可能非常艰巨,这取决于您要获得的详细程度。您可以进入regex或其他学习模型。人们以此为职业!显然,当您获得年末摘要时,您的银行正在对它们进行分类。

无论如何,这是一种生成某些类别并将其用作您要进行分组的基础的简单方法。

import pandas as pd


item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)

# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)

# we'll use the "contains" function in pandas as a simple converter...  quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)

# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
                'Sonoco': 'gas',
                'Arco': 'gas'}

# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
    df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]

# see how we did...
print('converted...')
print(df)

# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']


# now we have decent labels to support grouping!
print('\n\n  *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())

产量:

             item_orig charge
0     McDonalds Denver  12.44
1               Sonoco      4
2              ATM Fee      3
3  Sonoco, Ft. Collins  14.99
4   McDonalds, Boulder   19.1
5         Arco Boulder  52.99

items that containt the string "McDonalds"
            item_orig charge item
0    McDonalds Denver  12.44  NaN
4  McDonalds, Boulder   19.1  NaN
converted...
             item_orig charge             item
0     McDonalds Denver  12.44  McDonalds - any
1               Sonoco      4              gas
2              ATM Fee      3              NaN
3  Sonoco, Ft. Collins  14.99              gas
4   McDonalds, Boulder   19.1  McDonalds - any
5         Arco Boulder  52.99              gas


  *** sum of charges by group ***
item
ATM Fee             3.00
McDonalds - any    31.54
gas                71.98
Name: charge, dtype: float64
相关问题