如何将字符串从一列拆分为与列表匹配的两列?

时间:2020-02-04 17:18:25

标签: python pandas numpy

如何将字符串从一列拆分为与列表完全匹配的两列,从右开始?如果不匹配,请在“模型”列中将其留空

要比较的列表['PILOT', 'SRP637', '103', 'Mako', 'Kontiki', 'SKX007', 'Odyssey','Octo', 'Royal Oak Offshore']

DataFrame

  brand
0 ARCHIMEDE PILOT
1 Seiko SRP637
2 Sinn 103
3 Orient Mako
4 Eterna Kontiki
5 Seiko SKX007
6 Boldr Odyssey
7 Bvlgari Octo
8 Aegir
9 Audemars Piguet Royal Oak Offshore

拆分到此数据框

  brand           model
0 ARCHIMEDE       PILOT
1 Seiko           SRP637
2 Sinn            103
3 Orient          Mako
4 Eterna          Kontiki
5 Seiko           SKX007
6 Boldr           Odyssey
7 Bvlgari         Octo
8 Aegir
9 Audemars Piguet Royal Oak Offshore

2 个答案:

答案 0 :(得分:3)

MCVE

import pandas as pd
from io import StringIO

textfile = StringIO("""
   brand
0  ARCHIMEDE PILOT
1  Seiko SRP637
2  Sinn 103
3  Orient Mako
4  Eterna Kontiki
5  Seiko SKX007
6  Boldr Odyssey
7  Bvlgari Octo
8  Aegir
9  Audemars Piguet Royal Oak Offshore""")

df = pd.read_csv(textfile, sep='\s\s+', engine='python')

print("Input dataframe...\n")
print(df.to_markdown())

listcomp = ['PILOT', 'SRP637', '103', 'Mako', 'Kontiki', 'SKX007', 'Odyssey','Octo', 'Royal Oak Offshore']
regex = f'{"|".join(listcomp)}'
df['model'] = df['brand'].str.extract(f'(?P<model>{regex})')
df['brand'] = df['brand'].str.replace(regex,'')
print("Output dataframe...\n")
print(df.to_markdown())

输出:

Input dataframe...

|    | brand                              |
|---:|:-----------------------------------|
|  0 | ARCHIMEDE PILOT                    |
|  1 | Seiko SRP637                       |
|  2 | Sinn 103                           |
|  3 | Orient Mako                        |
|  4 | Eterna Kontiki                     |
|  5 | Seiko SKX007                       |
|  6 | Boldr Odyssey                      |
|  7 | Bvlgari Octo                       |
|  8 | Aegir                              |
|  9 | Audemars Piguet Royal Oak Offshore |
Output dataframe...

|    | brand           | model              |
|---:|:----------------|:-------------------|
|  0 | ARCHIMEDE       | PILOT              |
|  1 | Seiko           | SRP637             |
|  2 | Sinn            | 103                |
|  3 | Orient          | Mako               |
|  4 | Eterna          | Kontiki            |
|  5 | Seiko           | SKX007             |
|  6 | Boldr           | Odyssey            |
|  7 | Bvlgari         | Octo               |
|  8 | Aegir           | nan                |
|  9 | Audemars Piguet | Royal Oak Offshore |

选项1:

使用熊猫首先使用.str.split在空间上分割,然后使用whereisin

listcomp = ['PILOT', 'SRP637', '103', 'Mako', 'Kontiki', 'SKX007', 'Odyssey','Octo']
df_out = df['brand'].str.split(' ', expand=True).set_axis(['brand', 'model'], axis=1, inplace=False)
df_out['model'] = df_out['model'].where(df_out['model'].isin(listcomp))
df_out

输出:

|    | brand     | model   |
|---:|:----------|:--------|
|  0 | ARCHIMEDE | PILOT   |
|  1 | Seiko     | SRP637  |
|  2 | Sinn      | 103     |
|  3 | Orient    | Mako    |
|  4 | Eterna    | Kontiki |
|  5 | Seiko     | SKX007  |
|  6 | Boldr     | Odyssey |
|  7 | Bvlgari   | Octo    |
|  8 | Aegir     | nan     |

选项2

对命名组使用.str.extract

listcomp = ['PILOT', 'SRP637', '103', 'Mako', 'Kontiki', 'SKX007', 'Odyssey','Octo']
regex = f'{"|".join(listcomp)}'
df['brand'].str.extract(f'(?P<brand>\w+)\s?(?P<model>{regex})?')

输出:

|    | brand     | model   |
|---:|:----------|:--------|
|  0 | ARCHIMEDE | PILOT   |
|  1 | Seiko     | SRP637  |
|  2 | Sinn      | 103     |
|  3 | Orient    | Mako    |
|  4 | Eterna    | Kontiki |
|  5 | Seiko     | SKX007  |
|  6 | Boldr     | Odyssey |
|  7 | Bvlgari   | Octo    |
|  8 | Aegir     | nan     |

选项3(已更改的问题和数据进行了更新)

listcomp = ['PILOT', 'SRP637', '103', 'Mako', 'Kontiki', 'SKX007', 'Odyssey','Octo', 'Royal Oak Offshore']
regex = f'{"|".join(listcomp)}'
df['model'] = df['brand'].str.extract(f'(?P<model>{regex})')
df['brand'] = df['brand'].str.replace(regex,'')
df

输出:

|    | brand           | model              |
|---:|:----------------|:-------------------|
|  0 | ARCHIMEDE       | PILOT              |
|  1 | Seiko           | SRP637             |
|  2 | Sinn            | 103                |
|  3 | Orient          | Mako               |
|  4 | Eterna          | Kontiki            |
|  5 | Seiko           | SKX007             |
|  6 | Boldr           | Odyssey            |
|  7 | Bvlgari         | Octo               |
|  8 | Aegir           | nan                |
|  9 | Audemars Piguet | Royal Oak Offshore |

答案 1 :(得分:0)

如果我理解正确,那么您想要这样做:

df['model'] = df['brand'].apply(lambda x: x.split(' ')[1])

这将占用brand的每一行,并用空格将其分成两部分,并将第二个元素作为新列。

相关问题