Python文本处理(str.contains)

时间:2017-12-30 16:07:55

标签: python string pandas contains

我在pandas中使用str.contains进行文本分析。如果对于句子"我的最新数据工作是分析师" ,我想要一个词组合"数据" &安培; "分析"但同时我想指定用于组合的两个单词之间的单词数量(这里是"数据"和#34;分析师"之间的2个单词。目前我正在使用(DataFile.XXX.str.contains(' job')& DataFile.XXX.str.contains(' Analyst')获取"工作分析师&#34的计数;。 如何指定str.contains语法中2个单词之间的单词数。 提前致谢

1 个答案:

答案 0 :(得分:0)

你做不到。至少,不是以简单或标准化的方式。

即使是基础知识,比如你如何定义“单词”,也比你想象的更复杂很多。单词解析和词汇接近(例如“在句子s中彼此距离D内的两个单词?”)是natural language processing (NLP)的领域。 NLP和邻近搜索不是基本Pandas的一部分,也不是Python标准字符串处理的一部分。您可以导入类似NLTK, the Natural Language Toolkit之类的内容以一般方式解决此问题,但这是一个完整的'其他故事。

让我们看一个简单的方法。首先,您需要一种方法将字符串解析为单词。以下是NLP标准的粗略内容,但适用于更简单的情况:

def parse_words(s):
    """
    Simple parser to grab English words from string.
    CAUTION: A simplistic solution to a hard problem. 
             Many possibly-important edge- and corner-cases 
             not handled. Just one example: Hyphenated words.
    """
    return re.findall(r"\w+(?:'[st])?", s, re.I)

E.g:

>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']

然后,您需要一种方法来查找列表中找到目标词的所有索引:

def list_indices(target, seq):
    """
    Return all indices in seq at which the target is found.
    """
    indices = []
    cursor = 0
    while True:
        try:
            index = seq.index(target, cursor)
        except ValueError:
            return indices
        else:
            indices.append(index)
            cursor = index + 1

最后决策制作包装:

def words_within(target_words, s, max_distance, case_insensitive=True):
    """
    Determine if the two target words are within max_distance positiones of one
    another in the string s.
    """
    if len(target_words) != 2:
        raise ValueError('must provide 2 target words')

    # fold case for case insensitivity
    if case_insensitive:
        s = s.casefold()
        target_words = [tw.casefold() for tw in target_words]
        # for Python 2, replace `casefold` with `lower`

    # parse words and establish their logical positions in the string
    words = parse_words(s)
    target_indices = [list_indices(t, words) for t in target_words]

    # words not present
    if not target_indices[0] or not target_indices[1]:
        return False

    # compute all combinations of distance for the two words
    # (there may be more than one occurance of a word in s)
    actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]

    # answer whether the minimum observed distance is <= our specified threshold
    return min(actual_distances) <= max_distance

那么:

>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True

>>> words_within(["think", 'moment'], s, 2)
False

唯一要做的就是将其映射回Pandas:

df = pd.DataFrame({'desc': [
    'My latest Data job was an Analyst',
    'some day my prince will come',
    'Oh, somewhere over the rainbow bluebirds fly',
    "Won't you share a common disaster?",
    'job! rainbow! analyst.'
]})

df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))

这基本上就是你如何解决这个问题。请记住,这是一个粗略而简单的解决方案。一些简单提出的问题不是简单的回答。 NLP问题经常出现在其中。