两列之间的字符串模式匹配和索引-熊猫

时间:2020-07-08 15:19:19

标签: python pandas

我有一个带有两个文本列的数据框。一列的列值(例如说B列)基本上是另一列(假设说A列)的整个字符串的子字符串/一部分。我想在每个样式中查找模式,并检查Col A字符串的位置或开始字母的趋势。因此,我想生成三列,一列是子字符串的位置,另两列是前面的和以下字符。

以下是数据框的外观:

| Col A     | Col B |
----------------------
AGHXXXJ002  | XXX   |
AGHGHJJ002  | GHJ   |
ABCRTGHP001 | RTGH  |
ABCDFFP01   | DFF   |
ABCXGHJD09  | XGH   |

现在基于上述模式,我想生成两列:

| Col A     | Col B | Position                  | Preceding Chars | Following Chars |
-------------------------------------------------------------------------------------
AGHXXXJ002  | XXX   | [3, 5]                    |  AGH            | J002            |
 (Because XXX starts at index 3 and ends at 5)  |                 |                 |
AGHGHJJ002  | GHJ   | [3, 5]                    |  AGH            | J002            |
ABCRTGHP001 | RTGH  | [3, 6]                    |  ABC            | P001            |
ABCDFFP01   | DFFP  | [3, 5]                    |  ABC            | 01              |
ABCXGHJD09  | XGH   | [3, 5]                    |  ABC            | D09             |
HGMQQUTV01  | HGM   | [0, 2]                    |  NaN            | QQUTV01         |
GBHUJJS099  | BHU   | [1, 3]                    |  G              | JJS099          |

这是我想要的输出。我尝试使用for循环并刮除子字符串,但从未执行过,因此删除了代码。到现在为止,我一直在手动进行操作,但是有超过5万行,而且行不通。另外,位置列可以分为两个不同的列,开始位置和结束位置。

3 个答案:

答案 0 :(得分:1)

可能会对您有帮助

>>> import re
>>> import pandas

>>> df = pandas.DataFrame([["AGHXXXJ002", "XXX"], ["AGHGHJJ002", "GHJ"], ["ABCRTGHP001", "RTGH"], ["ABCDFFP01", "DFF"], ["ABCXGHJD09", "XGH"]], columns=["Col A", "Col B"])
>>> df
         Col A Col B
0   AGHXXXJ002   XXX
1   AGHGHJJ002   GHJ
2  ABCRTGHP001  RTGH
3    ABCDFFP01   DFF
4   ABCXGHJD09   XGH

>>> def get_position(row):
...     match = re.search(row["Col B"], row["Col A"])
...     if match:
...             return match.span()
...     else:
...             return [-1, -1]
... 
>>> df["Position"] = df.apply(get_position, axis=1)
>>> df
         Col A Col B Position
0   AGHXXXJ002   XXX   (3, 6)
1   AGHGHJJ002   GHJ   (3, 6)
2  ABCRTGHP001  RTGH   (3, 7)
3    ABCDFFP01   DFF   (3, 6)
4   ABCXGHJD09   XGH   (3, 6)

>>> def get_preceding(row):
...     if row["Position"][0] == -1:
...             return ""
...     return row["Col A"][:row["Position"][0]]
... 
>>> df["Preceding Chars"] = df.apply(get_preceding, axis=1)
>>> df
         Col A Col B Position Preceding Chars
0   AGHXXXJ002   XXX   (3, 6)             AGH
1   AGHGHJJ002   GHJ   (3, 6)             AGH
2  ABCRTGHP001  RTGH   (3, 7)             ABC
3    ABCDFFP01   DFF   (3, 6)             ABC
4   ABCXGHJD09   XGH   (3, 6)             ABC

>>> def get_following(row):
...     if row["Position"][1] == -1:
...             return ""
...     return row["Col A"][row["Position"][1]:]
... 
>>> df["Following Chars"] = df.apply(get_following, axis=1)
>>> df
         Col A Col B Position Preceding Chars Following Chars
0   AGHXXXJ002   XXX   (3, 6)             AGH            J002
1   AGHGHJJ002   GHJ   (3, 6)             AGH            J002
2  ABCRTGHP001  RTGH   (3, 7)             ABC            P001
3    ABCDFFP01   DFF   (3, 6)             ABC             P01
4   ABCXGHJD09   XGH   (3, 6)             ABC            JD09

答案 1 :(得分:0)

# Prepare test data

dct = {'Col A': {0: 'AGHXXXJ002',
  1: 'AGHGHJJ002',
  2: 'ABCRTGHP001',
  3: 'ABCDFFP01',
  4: 'ABCXGHJD09'},
 'Col B': {0: 'XXX', 1: 'GHJ', 2: 'RTGH', 3: 'DFF', 4: 'XGH'}}

df = pd.DataFrame.from_dict(dct)


tmp_lst = [x[0].split(x[1]) for x in zip(df['Col A'],df['Col B'])]         #  prepare temporary list with items: 'AGHXXXJ002'.split('XXX') -> [['AGH','J002'],.....]
df['Preceding Chars'] = [c[0] for c in tmp_lst]          # get first element ['AGH','J002'][0] -> 'AGH' 
df['Following Chars'] = [c[1] for c in tmp_lst]          # get second element ['AGH','J002'][1] -> 'J002' 
df['Position'] = [[len(i[0]), len(i[0])+ len(i[1])-1] for i in zip(df['Preceding Chars'], df['Col B'])]    

df
Out[1]:

    Col A       Col B   Preceding Chars Following Chars Position
0   AGHXXXJ002  XXX     AGH             J002            [3, 5]
1   AGHGHJJ002  GHJ     AGH             J002            [3, 5]
2   ABCRTGHP001 RTGH    ABC             P001            [3, 6]
3   ABCDFFP01   DFF     ABC             P01             [3, 5]
4   ABCXGHJD09  XGH     ABC             JD09            [3, 5]

答案 2 :(得分:0)

在处理行级操作和字符串时,没有矢量化方法可以做到这一点。

let使用str.findnp.char.find创建数据框。

#Note I've removed the spaces in your columns.
s = pd.DataFrame(df.apply(lambda x : x['ColA'].split(x['ColB']),axis=1).tolist())
idx = df.apply(lambda x : np.char.find(x['ColA'],x['ColB']),1)

pos = zip(idx.values, (idx - 1 + df["ColB"].str.len()).values)

df["Position"] = list(pos)
df['Proceeding Chars'], df['Following Chars'] = s[0], s[1]

print(df)

        ColA  ColB Position Proceeding Chars Following Chars
0   AGHXXXJ002   XXX   (3, 5)              AGH            J002
1   AGHGHJJ002   GHJ   (3, 5)              AGH            J002
2  ABCRTGHP001  RTGH   (3, 6)              ABC            P001
3    ABCDFFP01   DFF   (3, 5)              ABC             P01
4   ABCXGHJD09   XGH   (3, 5)              ABC            JD09
5   HGMQQUTV01   HGM   (0, 2)                          QQUTV01
6   GBHUJJS099   BHU   (1, 3)                G          JJS099