假设我有一个数据集df,如下所示:
x | y
----|--------
foo | 1.foo-ya
bar | 2.bar-ga
baz | 3.ha-baz
qux | None
我想过滤其中y恰好在中间的x行(不是开始也不是结束,即匹配模式'^。+ \ w +。+ $',到达第1和2行),不包括None / NaN:
x | y
----|-----
foo | 1.foo-ya
bar | 2.bar-ga
这是典型的成对字符比较,在SQL中很容易:
select x, y from df where y like concat('^.+', x, '.+%');
或在R中:
library(dplyr)
library(stringr)
library(glue)
df %>% filter(str_detect(y, glue('^.+{x}.+$')))
但是,由于我不是熊猫专家,所以熊猫似乎没有类似的简单“矢量化”正则表达式匹配方法吗?我应用了lambda方法:
import pandas as pd
import re
df.loc[df.apply(lambda row: bool(re.search(
'^.+' + row.x + '.+$', row.y))
if row.x and row.y else False, axis=1), :]
大熊猫中还有其他更优雅的方法吗?
此外,我想提取第一部分中产生的匹配记录中的前导数字(1、2,...):
x | y | z
----|----------|---
foo | 1.foo-ya | 1
bar | 2.bar-ga | 2
在R中,我可以进行简单的管道争吵:
df %>%
filter(str_detect(y, glue('^.+{x}.+$'))) %>%
mutate(z=str_replace(y, glue('^(\\d+)\\.{x}.+$'), '\\1') %>%
as.numeric)
但是在熊猫中,我只知道lambda方法。有没有比它更好的方法?
a = df.loc[df.apply(lambda row: bool(
re.search('^.+' + row.x + '.+$', row.y))
if row.x and row.y else False, axis=1),
['x', 'y']]
a['z'] = a.apply(lambda row: re.sub(
r'^(\d+)\.' + row.x + '.+$', r'\1', row.y), axis=1).astype('int')
a
顺便说一句,assign
方法无法正常工作。
df.loc[df.apply(lambda row: bool(re.search(
'^.+' + row.x + '.+$', row.y))
if row.x and row.y else False, axis=1),
['x', 'y']].assign(z=lambda row: re.sub(
r'^(\d+)\.' + row.x + '.+$', r'\1', row.y))
谢谢!
答案 0 :(得分:1)
pandas字符串操作基于python的字符串和re模块构建。尝试一下,看看它是否是您想要的:
import re
#find out if values in column x are in column y
#according to the pattern u wrote in the question
pattern = [re.match(fr'^.+{a}.+$',b)
for a,b
in zip(df.x.str.strip(),
df.y.str.strip())
]
match = [ent.group() if ent is not None else np.nan for ent in pattern]
#extract values for digit immediately preceding val in col x
ext = [re.search(fr'\d(?=\.{a})', b) for a,b in
zip(df.x.str.strip(),
df.y.str.strip())]
extract = [ent.group() if ent is not None else np.nan for ent in ext]
df['match'], df['extract'] = match, extract
x y match extract
1 foo 1.foo-ya 1.foo-ya 1
2 bar 2.bar-ga 2.bar-ga 2
3 baz 3.ha-baz NaN NaN
4 qux None NaN NaN
答案 1 :(得分:0)
感谢所有鼓舞人心的答复。我不得不说,尽管Python在许多领域都表现出色,但在进行这种矢量化操作时,我还是更喜欢R。所以我为这种情况重新发明了轮子。
def str_detect(string: pd.Series, pattern: pd.Series) -> List[bool]:
"""mimic str_detect in R
"""
if len(string) > len(pattern):
pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
elif len(string) < len(pattern):
pattern = pattern[1:len(string)]
return [bool(re.match(y, x)) if x and y else False
for x, y in zip(string, pattern)]
def str_extract(string: pd.Series, pattern: pd.Series) -> List[str]:
"""mimic str_extract in R
"""
if len(string) > len(pattern):
pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
elif len(string) < len(pattern):
pattern = pattern[1:len(string)]
o = [re.search(y, x) if x and y else None
for x, y in zip(string, pattern)]
return [x.group() if x else np.nan for x in o]
然后
df.loc[str_detect(
df['y'], '^.+' + df['x']+'.+$'), ['x', 'y']]
(df
.assign(z=str_extract(df['y'], r'^(\d+)(?=\.' + df['x'] + ')'))
.dropna(subset=['z'])
.loc[:, ['x', 'y', 'z']])
答案 2 :(得分:0)
这是您想要的方式吗?几乎复制了你在 R 中所做的:
>>> from numpy import vectorize
>>> from pipda import register_func
>>> from datar.all import f, tribble, filter, grepl, paste0, mutate, sub, as_numeric
[2021-06-24 17:27:16][datar][WARNING] Builtin name "filter" has been overriden by datar.
>>>
>>> df = tribble(
... f.x, f.y,
... "foo", "1.foo-ya",
... "bar", "2.bar-ga",
... "baz", "3.ha-baz",
... "qux", None
... )
>>>
>>> @register_func(None)
... @vectorize
... def str_detect(text, pattern):
... return grepl(pattern, text)
...
>>> @register_func(None)
... @vectorize
... def str_replace(text, pattern, replacement):
... return sub(pattern, replacement, text)
...
>>> df >> \
... filter(str_detect(f.y, paste0('^.+', f.x, '.+$'))) >> \
... mutate(z=as_numeric(str_replace(f.y, paste0(r'^(\d+)\.', f.x, '.+$'), r'\1')))
x y z
<object> <object> <float64>
0 foo 1.foo-ya 1.0
1 bar 2.bar-ga 2.0
免责声明:我是 datar
软件包的作者。