Question

在col_a中找到目标词和前一个词，并在col_b_PY和col_c_LG列中附加匹配的字符串

    This code i have tried to achive this functionality but not able to 
get the expected output. if any help appreciated
Here is the below code i approach with regular expressions:

df[''col_b_PY']=df.col_a.str.contains(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+) 
{0,1}PY")

df.col_a.str.extract(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}PY",expand=True)

数据框看起来像这样

col_a

Python PY is a general-purpose language LG

Programming language LG in Python PY 

Its easier LG to understand  PY

The syntax of the language LG is clean PY

所需的输出：

col_a                                       col_b_PY      col_c_LG
Python PY is a general-purpose language LG  Python PY     language LG

Programming language LG in Python PY        Python PY     language LG

Its easier LG to understand  PY            understand PY easier LG

The syntax of the language LG is clean PY   clean  PY     language LG

Answer 1

您可以使用

df['col_b_PY'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+PY)\b")
df['col_c_LG'] = df['col_a'].str.extract(r"([a-zA-Z'-]+\s+LG)\b")

或者，提取所有匹配项并在其之间加上空格：

df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)

请注意，您需要在正则表达式模式中使用捕获组，以便extract实际上可以提取文本：

在正则表达式 pat 中提取捕获组作为DataFrame中的列。

请注意，\b字边界必须与PY / LG整体匹配。

此外，如果您只想从一个字母开始比赛，则可以将模式修改为

r"([a-zA-Z][a-zA-Z'-]*\s+PY)\b"
r"([a-zA-Z][a-zA-Z'-]*\s+LG)\b"
   ^^^^^^^^          ^

其中[a-zA-Z]将匹配一个字母，而[a-zA-Z'-]*将匹配0个或多个字母，撇号或连字符。

Python 3.7和Pandas 0.24.2：

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 500)

df = pd.DataFrame({
    'col_a': ['Python PY is a general-purpose language LG',
             'Programming language LG in Python PY',
             'Its easier LG to understand  PY',
             'The syntax of the language LG is clean PY',
             'Python PY is a general purpose PY language LG']
    })
df['col_b_PY'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+PY)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)
df['col_c_LG'] = df['col_a'].str.extractall(r"([a-zA-Z'-]+\s+LG)\b").unstack().apply(lambda x:' '.join(x.dropna()), axis=1)

输出：

                                           col_a              col_b_PY     col_c_LG
0     Python PY is a general-purpose language LG             Python PY  language LG
1           Programming language LG in Python PY             Python PY  language LG
2                Its easier LG to understand  PY        understand  PY    easier LG
3      The syntax of the language LG is clean PY              clean PY  language LG
4  Python PY is a general purpose PY language LG  Python PY purpose PY  language LG

Answer 2

检查

df['col_c_LG'],df['col_c_PY']=df['col_a'].str.extract(r"(\w+\s+LG)"),df['col_a'].str.extract(r"(\w+\s+PY)")
df
Out[474]: 
                                        col_a       ...              col_c_PY
0  Python PY is a general-purpose language LG       ...             Python PY
1       Programming language LG in Python PY        ...             Python PY
2             Its easier LG to understand  PY       ...        understand  PY
3   The syntax of the language LG is clean PY       ...              clean PY
[4 rows x 3 columns]

PANDAS在字符串列中查找确切的单词和单词之前的内容，并将该新列附加到python（pandas）列中

2 个答案: