我在pandas中使用str.contains进行文本分析。如果对于句子"我的最新数据工作是分析师" ,我想要一个词组合"数据" &安培; "分析"但同时我想指定用于组合的两个单词之间的单词数量(这里是"数据"和#34;分析师"之间的2个单词。目前我正在使用(DataFile.XXX.str.contains(' job')& DataFile.XXX.str.contains(' Analyst')获取"工作分析师&#34的计数;。 如何指定str.contains语法中2个单词之间的单词数。 提前致谢
答案 0 :(得分:0)
你做不到。至少,不是以简单或标准化的方式。
即使是基础知识,比如你如何定义“单词”,也比你想象的更复杂很多。单词解析和词汇接近(例如“在句子s中彼此距离D内的两个单词?”)是natural language processing (NLP)的领域。 NLP和邻近搜索不是基本Pandas的一部分,也不是Python标准字符串处理的一部分。您可以导入类似NLTK, the Natural Language Toolkit之类的内容以一般方式解决此问题,但这是一个完整的'其他故事。
让我们看一个简单的方法。首先,您需要一种方法将字符串解析为单词。以下是NLP标准的粗略内容,但适用于更简单的情况:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
然后,您需要一种方法来查找列表中找到目标词的所有索引:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
最后决策制作包装:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
那么:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
唯一要做的就是将其映射回Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
这基本上就是你如何解决这个问题。请记住,这是一个粗略而简单的解决方案。一些简单提出的问题不是简单的回答。 NLP问题经常出现在其中。