我正在根据其他符合条件的列创建一个新的字符串列。
我的目标是向外扩展以读取12个字段/ 30,000行错误分类的数据。
示例数据:
df = pd.DataFrame({'clothes': ['furry boots', 'weird boots', 'furry gloves', 'weird gloves', 'furry coat', 'weird coat'],
'barn': ['furry animal', 'big animal', 'furry fence', 'old fence', 'furry door', 'old door'],
'crazy': ['heckin food', 'furry food', 'furry toes', 'old toes', 'furry hat', 'crazy cat']})
df
+---+--------------+--------------+-------------+
| | sparkle | misty | crazy |
+---+--------------+--------------+-------------+
| 0 | furry boots | furry animal | heckin food |
| 1 | weird boots | big animal | furry food |
| 2 | furry gloves | furry fence | furry toes |
| 3 | weird gloves | old fence | old toes |
| 4 | furry coat | furry door | furry hat |
| 5 | weird coat | old door | crazy cat |
+---+--------------+--------------+-------------+
所需的输出:
+---+--------------+--------------+-------------+---------------------------------------+
| | sparkle | misty | crazy | furry |
+---+--------------+--------------+-------------+---------------------------------------+
| 0 | furry boots | furry animal | heckin food | furry boots, furry animal |
| 1 | weird boots | big animal | furry food | furry food |
| 2 | furry gloves | furry fence | furry toes | furry gloves, furry fence, furry toes |
| 3 | weird gloves | old fence | old toes | |
| 4 | furry coat | furry door | furry hat | furry coat, furry door, furry hat |
| 5 | weird coat | old door | crazy cat | |
+---+--------------+--------------+-------------+---------------------------------------+
我当前的解决方案
df['furry'] = ''
df
df.loc[df['sparkle'].str.contains('furry'), 'furry'] = df['sparkle']
df.loc[df['misty'].str.contains('furry'), 'furry'] = df['furry'] + ', ' + df['misty']
df.loc[df['crazy'].str.contains('furry'), 'furry'] = df[['furry', 'crazy']].apply(lambda x: ', '.join(x), axis=1)
df
+---+--------------+--------------+-------------+---------------------------------------+
| | sparkle | misty | crazy | furry |
+---+--------------+--------------+-------------+---------------------------------------+
| 0 | furry boots | furry animal | heckin food | furry boots, furry animal |
| 1 | weird boots | big animal | furry food | , furry food |
| 2 | furry gloves | furry fence | furry toes | furry gloves, furry fence, furry toes |
| 3 | weird gloves | old fence | old toes | |
| 4 | furry coat | furry door | furry hat | furry coat, furry door, furry hat |
| 5 | weird coat | old door | crazy cat | |
+---+--------------+--------------+-------------+---------------------------------------+
这个“有效”,我可以清理后记,但是感觉很糟糕。希望在这里学习。
我正在尝试和努力的事情:
就像我在上面提到的那样,我想将其减少为读取12列,许多行以及一个单词库。我觉得我快要到了……我看过''.join(),在文档中扫描了concat(),merge()...我只是感到困惑。
df = pd.DataFrame({'sparkle': ['furry boots', 'weird boots', 'furry gloves', 'weird gloves', 'furry coat', 'weird coat'],
'misty': ['furry animal', 'big animal', 'furry fence', 'old fence', 'furry door', 'old door'],
'crazy': ['heckin food', 'furry food', 'furry toes', 'old toes', 'furry hat', 'crazy cat']})
df['furry'] = ''
words = ['furry', 'old'] # added another word to demonstrate intent with real data
for key, value in df.items():
df.loc[df[key].str.contains('|'.join(words)), 'furry'] = df['furry'] + ', ' + df[key]
df
+---+--------------+--------------+-------------+----------------------------------------------------------------------------------+
| | sparkle | misty | crazy | furry |
+---+--------------+--------------+-------------+----------------------------------------------------------------------------------+
| 0 | furry boots | furry animal | heckin food | , furry boots, furry animal, , furry boots, furry animal |
| 1 | weird boots | big animal | furry food | , furry food, , furry food |
| 2 | furry gloves | furry fence | furry toes | , furry gloves, furry fence, furry toes, , furry gloves, furry fence, furry toes |
| 3 | weird gloves | old fence | old toes | , old fence, old toes, , old fence, old toes |
| 4 | furry coat | furry door | furry hat | , furry coat, furry door, furry hat, , furry coat, furry door, furry hat |
| 5 | weird coat | old door | crazy cat | , old door, , old door |
+---+--------------+--------------+-------------+----------------------------------------------------------------------------------+
有人有任何指示/提示吗?感谢您的阅读。
答案 0 :(得分:3)
apply
函数words = ['furry', 'old']
for word in words:
df[word] = df.apply(lambda x: ', '.join([str(c) for c in x if word in str(c)]), axis=1)
df['all_combined'] = df[words].apply(lambda x:', '.join(x), axis=1)
df = df.drop(words, axis=1)
更新:您可以遍历多个单词并为每个单词创建一个新列。
Update2:同样,您可以使用apply
将其合并。
解决方案2:
words = ['furry', 'old']
df['all_combined'] = df.apply(lambda x: ', '.join([str(c) for c in x if any([w in str(c) for w in words])]), axis=1)