Question

我有要分类的文字数据。使用我在其中指定单个字符串的for循环，我可以识别特定单词或短语是否存在于另一列的行中。如果为true，则循环将特定值附加到新列表。然后将新列表添加到DataFrame。然而，这种方法对于我的实际数据来说太笨重了，因为我需要为众多测试指定许多字符串。

有没有办法可以在单个数据结构中对单个字符串进行分组，循环可以在其中搜索？这意味着循环中的每个测试只引用一个数据结构，而不是在循环内拼写出的单个字符串。可以这样做吗？

以下是我目前所做工作的可重复示例，突出显示了该问题。

    data = {
        'opinion': ['He said it was too expensive',
                      'She said it was too costly',
                      'He thought it was not fast enough',
                      'They thought is was not right and too much money',
                      'Her view was that it was too small and too slow', 
                   ]}

df = pd.DataFrame(data, columns = ['opinion'])
df

创建此：

    opinion
0   He said it was too expensive
1   She said it was too costly
2   He thought it was not fast enough
3   They thought is was not right and too much money
4   Her view was that it was too small and too slow

然后，此for循环执行以下分类。

new_col=[]

for row in df['opinion']:
    if 'too expensive' in row or 'too costly' in row or 'too much money' in row:
        new_col.append('Too Expensive')
    elif 'not fast enough' in row or 'too slow' in row:
        new_col.append('Too Slow')

df['reason'] = new_col
df

    opinion                                           reason
0   He said it was too expensive                      Too Expensive
1   She said it was too costly                        Too Expensive
2   He thought it was not fast enough                 Too Slow
3   They thought is was not right and too much money  Too Expensive
4   Her view was that it was too small and too slow   Too Slow

在我的实际数据中，虽然我不能在每个测试中在循环内写出多个单独的字符串，但是太多了。

Answer 1

您可以将list dictionaries保留在keys replacement values，lists包含to_replace单词words = [{'Too Expensive': ['too expensive', 'too costly', 'too much money'], 'Too Slow': ['not fast enough', 'too slow']}]。

loop

然后words超过str.contains，regex使用to_replace一次查看所有.loc[]，for word in words: for replacement, to_replace in word.items(): df.loc[df.opinion.str.contains('|'.join(to_replace)), 'reason'] = replacement识别和opinion reason 0 He said it was too expensive Too Expensive 1 She said it was too costly Too Expensive 2 He thought it was not fast enough Too Slow 3 They thought is was not right and too much money Too Expensive 4 Her view was that it was too small and too slow Too Slow分配。

print_r(array_slice($table_list, $filter, NULL, TRUE));

得到：

(define (read-line . port)
  (define (eat p c)
    (if (and (not (eof-object? (peek-char p)))
             (char=? (peek-char p) c))
        (read-char p)))
  (let ((p (if (null? port) (current-input-port) (car port))))
    (let loop ((c (read-char p)) (line '()))
      (cond ((eof-object? c) (if (null? line) c (list->string (reverse line))))
            ((char=? #\newline c) (eat p #\return) (list->string (reverse line)))
            ((char=? #\return c) (eat p #\newline) (list->string (reverse line)))
            (else (loop (read-char p) (cons c line)))))))

Answer 2

这应该有效：

test_strings = ['too expensive', 'too costly', 'too much money']
for row in df['opinion']:
    for tester in test_strings:
        if tester in row:
            new_col.append("Too Expensive")
            break

Answer 3

我认为在这种情况下使用RegEx会更方便：

df['reason'] = ''

df.ix[df.opinion.str.lower().str.contains(r'too\s+(?:expensive|costly|much money)'), 'reason'] = 'Too Expensive'

df.ix[df.opinion.str.lower().str.contains(r'(?:not fast enough|too slow)'), 'reason'] = 'Too Slow'

In [309]: df
Out[309]:
                                            opinion         reason
0                      He said it was too expensive  Too Expensive
1                        She said it was too costly  Too Expensive
2                 He thought it was not fast enough       Too Slow
3  They thought is was not right and too much money  Too Expensive
4   Her view was that it was too small and too slow       Too Slow

Answer 4

Pandas有一个快速的解决方案，可以将函数应用于行，所以.apply就是为此而设计的。理想情况下，矢量化是最快的，但我想不出这样做的方法。 .apply就在那之后，迭代行是最慢的，所以最好尽可能避免它。

此外，您可能希望使用字典作为关键字列表，以便扩展潜在关键字列表。

def categorizer(x):
main_dict = {"too much money":"too expensive", "too expensive":"too expensive", "too costly":"too expensive", "too slow":"too slow", "not fast enough": "not fast enough"}
for key in main_dict:
    if key in x:
        return main_dict[key]
df["Category"] = df["opinion"].apply(lambda x:categorizer(x))

在for循环中搜索字符串，而不在循环中单独引用每个字符串

4 个答案: