基本上,我有一个数据框,其中一列是名称列表,另一列是通过某种方式与名称相关的关联网址(示例df):
Name Domain
'Apple Inc' 'https://mapquest.com/askjdnas387y1/apple-inc', 'https://linkedin.com/apple-inc/askjdnas387y1/', 'https://www.apple-inc.com/asdkjsad542/'
'Aperture Industries' 'https://www.cakewasdelicious.com/aperture/run-away/', 'https://aperture-incorporated.com/aperture/', 'https://www.buzzfeed.com/aperture/the-top-ten-most-evil-companies=will-shock-you/'
'Umbrella Corp' 'https://www.umbrella-corp.org/were-not-evil/', 'https://umbrella.org/experiment-death/', 'https://www.most-evil.org/umbrella-corps/'
我试图在以下任一情况下直接找到具有关键字或至少与关键字部分匹配的网址:
'https://NAME.whateverthispartdoesntmatter'
或
'https://www.NAME.whateverthispartdoesntmatter' <- not a real link
现在我正在使用Fuzzywuzzy获取部分匹配项:
fuzz.token_set_ratio(name, value)
它对于部分匹配非常有效,但是匹配不依赖于位置,因此我会得到一个完美的关键字匹配,但它位于URL中间的某个地方,不是我想要的:
https://www.bloomberg.com/profiles/companies/aperture-inc/0117091D
答案 0 :(得分:1)
explode/unnest string
,str.extract
和fuzzywuzzy
首先,我们将使用this函数将您的字符串嵌套到行中:
df = explode_str(df, 'Domain', ',').reset_index(drop=True)
然后,我们使用正则表达式查找带有或不带有www
的两个模式,并从中提取名称:
m = df['Domain'].str.extract('https://www.(.*)\.|https://(.*)\.')
df['M'] = m[0].fillna(m[1])
print(df)
Name Domain M
0 Apple Inc https://mapquest.com/askjdnas387y1/apple-inc mapquest
1 Apple Inc https://linkedin.com/apple-inc/askjdnas387y1/ linkedin
2 Apple Inc https://www.apple-inc.com/asdkjsad542/ apple-inc
3 Aperture Industries https://www.cakewasdelicious.com/aperture/run-... cakewasdelicious
4 Aperture Industries https://aperture-incorporated.com/aperture/ aperture-incorporated
5 Aperture Industries https://www.buzzfeed.com/aperture/the-top-ten... buzzfeed
6 Umbrella Corp https://www.umbrella-corp.org/were-not-evil/ umbrella-corp
7 Umbrella Corp https://umbrella.org/experiment-death/ umbrella
8 Umbrella Corp https://www.most-evil.org/umbrella-corps/ most-evil
然后,我们使用fuzzywuzzy
来过滤匹配度高于80
的行:
from fuzzywuzzy import fuzz
m2 = df.apply(lambda x: fuzz.token_sort_ratio(x['Name'], x['M']), axis=1)
df[m2>80]
Name Domain M
2 Apple Inc https://www.apple-inc.com/asdkjsad542/ apple-inc
6 Umbrella Corp https://www.umbrella-corp.org/were-not-evil/ umbrella-corp
注意,我使用token_sort_ratio
而不是token_set_ratio
来捕捉umbrella
和umbrella-corp
的区别
链接答案中使用的功能:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})