我正在对银行对帐单(csv)进行分析。诸如麦当劳之类的某些商品各有一行(由于地址不同)。
我正在尝试通过一个常用短语来组合这些行。因此,在此示例中,显而易见的短语或字符串为“ McDonalds”。我认为这将是一个if语句。
此外,该列的dtype为“对象”。我需要将其转换为字符串格式吗?
这是我的代码中打印结果totali = df.Item.value_counts()
的示例输出。
理想情况下,我希望该行将McDonalds输出为单行。 在csv中,它们是2个单独的行。
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
这是列数据的组成部分
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
答案 0 :(得分:0)
好。我想我发现了一些有用的东西。意识到从文本字符串推断类别或名称的任务可能非常艰巨,这取决于您要获得的详细程度。您可以进入regex
或其他学习模型。人们以此为职业!显然,当您获得年末摘要时,您的银行正在对它们进行分类。
无论如何,这是一种生成某些类别并将其用作您要进行分组的基础的简单方法。
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64