Question

我正在处理Twitter主题标签，并且已经计算出它们出现在csv文件中的次数。我的csv文件如下：

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

现在，我想使用Fuzzywuzzy库将两个接近的术语归为一组，例如“ GilletsJaunes”和“ gilletsjaune”。如果两个项之间的接近度大于80，则仅在两个项之一中将它们的值相加，而将另一个项删除。这将给出：

GilletsJaunes, 120
Macron, 50
tax, 10

要使用“ fuzzywuzzy”：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output

Answer 1

首先，复制these two functions以便能够计算argmax：

import { Component, Input } from '@angular/core';

@Component({...})
export class HelloComponent  {
  @Input() name: string;
  constructor() {
  }
}

第二，将CSV的内容加载到Python字典中，然后执行以下操作：

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

{'GilletsJaunes'：120，'Macron'：50，'tax'：10}

Answer 2

这可以解决您的问题。您可以通过先将标签转换为小写来减少输入样本。我不确定Fuzzywuzzy的工作原理，但是我会怀疑“ Helol”，“ hello”和“ HELLO”总是大于80，并且它们代表相同的单词。

用Python中的值对字符串进行分组

2 个答案: