我有一个二元组列表。
我有一个熊猫数据框,其中包含我的主体中每个文档的一行。我想要做的是将每个文档中与我列表匹配的双字母组放入数据框中的新列。
完成这项任务的最佳方法是什么?我一直在寻找有关堆栈溢出的答案,但没有找到可以给我特定答案的东西。我需要新列包含从我的二元列表中找到的每个二元。
任何帮助将不胜感激!
下面的输出是我想要的输出,尽管在我的实际示例中,我使用了停用词,所以找不到像下面的输出一样的精确二元组。有没有办法处理可能包含的某些字符串?
import pandas as pd
data = [['help me with my python pandas please'], ['machine learning is fun using svd with sklearn']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Message'])
import numpy as np
bigrams =[('python', 'pandas'),
('function', 'input'),
('help', 'jupyter'),
('sklearn', 'svd')]
def matcher(x):
for i in bigrams:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Message'].apply(matcher)
df
答案 0 :(得分:2)
这就是我要做的:
# a sample, which you should've given
df = pd.DataFrame({'sentences': ['I like python pandas',
'find all function input from help jupyter',
'this has no bigrams']})
或者您可以使用get_dummies
:
new_df.str.join(',').str.get_dummies(sep=',')
为您提供:
function input help jupyter python pandas
0 0 0 1
1 1 1 0
2 0 0 0
# the bigrams
bigrams = [('python', 'pandas'),
('function', 'input'),
('help', 'jupyter'),
('sklearn', 'svd')]
# create one big regex pattern:
pat = '|'.join(" ".join(x) for x in bigrams)
new_df = df.sentences.str.findall(pat)
给你
0 [python pandas]
1 [function input, help jupyter]
2 []
Name: sentences, dtype: object
然后,您可以选择unnest每行中的列表。
或者您可以使用get_dummies
:
new_df.str.join(',').str.get_dummies(sep=',')
为您提供:
function input help jupyter python pandas
0 0 0 1
1 1 1 0
2 0 0 0
答案 1 :(得分:1)
好吧,这是我的解决方案,其特征是在已清除的语音(句子)中检测双字母词。
它也可以很容易地推广为n-gram。它还考虑了停用词。
您可以调整:
请注意,此实现是递归的。
import pandas as pd
import re
from nltk.corpus import stopwords
data = [
['help me with my python pandas please'],
['machine learning is fun using svd with sklearn'],
['please use |svd| with sklearn, get help on JupyteR!']
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Message'])
bigrams =[
('python', 'pandas'),
('function', 'input'),
('help', 'jupyter'),
('svd', 'sklearn')
]
stop_words = set(stopwords.words('english'))
sep = ' '
def _cleanup_token(w):
""" Cleanup a token by stripping special chars """
return re.sub('[^A-Za-z0-9]+', '', w)
def _preprocessed_tokens(x):
""" Preprocess a sentence. """
return list(map(lambda w: _cleanup_token(w), x.lower().split(sep)))
def _match_bg_term_in_sentence(bg, x, depth, target_depth=2):
""" """
if depth == target_depth:
return True # the whole bigram was matched
term = bg[depth]
term = term.lower()
pp_tokens = _preprocessed_tokens(x)
if term in pp_tokens:
bg_idx = pp_tokens.index(term)
if depth > 0 and any([token not in stop_words for token in pp_tokens[0:bg_idx]]):
return False # no bigram detected
x = sep.join(pp_tokens[bg_idx+1:])
return _match_bg_term_in_sentence(bg, x, depth+1, target_depth=target_depth)
else:
return False
def matcher(x):
""" Return list of bigrams matched in sentence x """
depth = 0 # current depth
matchs = []
for bg in bigrams:
bg_idx = 0 # first term
bg_matchs = _match_bg_term_in_sentence(bg, x, depth, target_depth=2)
if bg_matchs is True:
matchs.append(bg)
return matchs
df['Match'] = df['Message'].apply(matcher)
print(df.head())
我们实际上获得了以下结果:
Match
0 [(python, pandas)]
1 [(svd, sklearn)]
2 [(help, jupyter), (svd, sklearn)]
希望这会有所帮助!
答案 2 :(得分:1)
flashtext也可以用来解决此问题
import pandas as pd
from flashtext import KeywordProcessor
from nltk.corpus import stopwords
stop = stopwords.words('english')
bigram_token = ['python pandas','function input', 'help jupyter','svd sklearn']
data = [['help me with my python pandas please'], ['machine learning is fun using svd
with sklearn']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Message'])
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)
def bigram_finder(x, stop, kp):
token = x.split()
sent = ' '.join([x for x in token if x not in stop])
return kp.extract_keywords(sent)
df['bigram_token'] = df['Message'].apply(lambda x : bigram_finder(x, stop, kp))
#ouptput
0 [python pandas]
1 [svd sklearn]
Name: bigram_token, dtype: object