Question

在这里，我有两个dataframe列。 A和B。对于每一行[i]，所有B都包含在A中，现在我试图测试A中的B，并为匹配短语中的所有单词返回1，为A中的所有其他单词返回0在词组B之外，从而创建了一个新的0和1数据框。

    Why would it be competitive, so it's wond...        if the teabaggers hadn't ousted Sen
    Had he refused to attempt something so partisa...   Had he refused to attempt something so partisa...
    "This study would then have to be conducted an...   This study would then have to be conducted and

预期的数据框。

['0', '0', '0', '0' , '0', '1', '1', '1', '1', '1', '1'........]

我主要尝试了两种方法，但是在我在stackoverflow上找到的第一种方法中，它是针对B中的单个单词而不是B列的整个短语进行测试的，因此我将得到这样的结果

['0', '1', '0', '0' , '0', '1', '1', '1', '1', '1', '1', ........]

B中的值“ is”或“ and”总是容易出现在词组之外并返回错误结果。

我还尝试过正则表达式，该表达式对于单个实例非常适用，但是我无法将其应用于数据框，效果不佳。这有点棘手，它会返回无数的1行或耗尽内存。

rx = '({})'.format('|'.join(re.escape(el)for el in B))
     # Generator to yield replaced sentences, rep_lace is a column of 1's for each word in B
it = (re.sub(rx, rep_lace, sentence)for sentence in A)
     # Build list of paired new sentences and old to filter out where not the same
results.append([new_sentence for old_sentence, new_sentence in zip(A, it) if old_sentence != new_sentence])
nw_results = ' '.join([str(elem) for elem in results])
ew_results= nw_results.split(" ")
new_results = ['0' if i is not '1' else i for i in ew_results]
labels =([int(e) for e in new_results])

我希望我给出足够清楚的解释。

Answer 1

我不完全了解您对“是”和“和”的含义以及它们为什么会产生错误。但是总的来说，如果您尝试基于列A和列B中的值构造列C，则最好的方法是使用Lambda函数。

def word_match(col_1, col_2):
    # Gather all words in column B to check column A against
    targets = set(col_2.split())
    # For each word in A, if it's in B then 1, else 0
    output = [1 if x in targets else 0 for x in col_1.split()]
    return output

# Create new column, C, whose value on each row is word_match(A, B) on each row
df['C'] = df.apply(lambda x: word_match(x.A, x.B), axis=1)

希望这会有所帮助！

比较句子字符串的两个数据帧列，并为第三帧创建新值

1 个答案: