我有一个带有两个文本列的数据框。一列的列值(例如说B列)基本上是另一列(假设说A列)的整个字符串的子字符串/一部分。我想在每个样式中查找模式,并检查Col A字符串的位置或开始字母的趋势。因此,我想生成三列,一列是子字符串的位置,另两列是前面的和以下字符。
以下是数据框的外观:
| Col A | Col B |
----------------------
AGHXXXJ002 | XXX |
AGHGHJJ002 | GHJ |
ABCRTGHP001 | RTGH |
ABCDFFP01 | DFF |
ABCXGHJD09 | XGH |
现在基于上述模式,我想生成两列:
| Col A | Col B | Position | Preceding Chars | Following Chars |
-------------------------------------------------------------------------------------
AGHXXXJ002 | XXX | [3, 5] | AGH | J002 |
(Because XXX starts at index 3 and ends at 5) | | |
AGHGHJJ002 | GHJ | [3, 5] | AGH | J002 |
ABCRTGHP001 | RTGH | [3, 6] | ABC | P001 |
ABCDFFP01 | DFFP | [3, 5] | ABC | 01 |
ABCXGHJD09 | XGH | [3, 5] | ABC | D09 |
HGMQQUTV01 | HGM | [0, 2] | NaN | QQUTV01 |
GBHUJJS099 | BHU | [1, 3] | G | JJS099 |
这是我想要的输出。我尝试使用for循环并刮除子字符串,但从未执行过,因此删除了代码。到现在为止,我一直在手动进行操作,但是有超过5万行,而且行不通。另外,位置列可以分为两个不同的列,开始位置和结束位置。
答案 0 :(得分:1)
可能会对您有帮助
>>> import re
>>> import pandas
>>> df = pandas.DataFrame([["AGHXXXJ002", "XXX"], ["AGHGHJJ002", "GHJ"], ["ABCRTGHP001", "RTGH"], ["ABCDFFP01", "DFF"], ["ABCXGHJD09", "XGH"]], columns=["Col A", "Col B"])
>>> df
Col A Col B
0 AGHXXXJ002 XXX
1 AGHGHJJ002 GHJ
2 ABCRTGHP001 RTGH
3 ABCDFFP01 DFF
4 ABCXGHJD09 XGH
>>> def get_position(row):
... match = re.search(row["Col B"], row["Col A"])
... if match:
... return match.span()
... else:
... return [-1, -1]
...
>>> df["Position"] = df.apply(get_position, axis=1)
>>> df
Col A Col B Position
0 AGHXXXJ002 XXX (3, 6)
1 AGHGHJJ002 GHJ (3, 6)
2 ABCRTGHP001 RTGH (3, 7)
3 ABCDFFP01 DFF (3, 6)
4 ABCXGHJD09 XGH (3, 6)
>>> def get_preceding(row):
... if row["Position"][0] == -1:
... return ""
... return row["Col A"][:row["Position"][0]]
...
>>> df["Preceding Chars"] = df.apply(get_preceding, axis=1)
>>> df
Col A Col B Position Preceding Chars
0 AGHXXXJ002 XXX (3, 6) AGH
1 AGHGHJJ002 GHJ (3, 6) AGH
2 ABCRTGHP001 RTGH (3, 7) ABC
3 ABCDFFP01 DFF (3, 6) ABC
4 ABCXGHJD09 XGH (3, 6) ABC
>>> def get_following(row):
... if row["Position"][1] == -1:
... return ""
... return row["Col A"][row["Position"][1]:]
...
>>> df["Following Chars"] = df.apply(get_following, axis=1)
>>> df
Col A Col B Position Preceding Chars Following Chars
0 AGHXXXJ002 XXX (3, 6) AGH J002
1 AGHGHJJ002 GHJ (3, 6) AGH J002
2 ABCRTGHP001 RTGH (3, 7) ABC P001
3 ABCDFFP01 DFF (3, 6) ABC P01
4 ABCXGHJD09 XGH (3, 6) ABC JD09
答案 1 :(得分:0)
# Prepare test data
dct = {'Col A': {0: 'AGHXXXJ002',
1: 'AGHGHJJ002',
2: 'ABCRTGHP001',
3: 'ABCDFFP01',
4: 'ABCXGHJD09'},
'Col B': {0: 'XXX', 1: 'GHJ', 2: 'RTGH', 3: 'DFF', 4: 'XGH'}}
df = pd.DataFrame.from_dict(dct)
tmp_lst = [x[0].split(x[1]) for x in zip(df['Col A'],df['Col B'])] # prepare temporary list with items: 'AGHXXXJ002'.split('XXX') -> [['AGH','J002'],.....]
df['Preceding Chars'] = [c[0] for c in tmp_lst] # get first element ['AGH','J002'][0] -> 'AGH'
df['Following Chars'] = [c[1] for c in tmp_lst] # get second element ['AGH','J002'][1] -> 'J002'
df['Position'] = [[len(i[0]), len(i[0])+ len(i[1])-1] for i in zip(df['Preceding Chars'], df['Col B'])]
df
Out[1]:
Col A Col B Preceding Chars Following Chars Position
0 AGHXXXJ002 XXX AGH J002 [3, 5]
1 AGHGHJJ002 GHJ AGH J002 [3, 5]
2 ABCRTGHP001 RTGH ABC P001 [3, 6]
3 ABCDFFP01 DFF ABC P01 [3, 5]
4 ABCXGHJD09 XGH ABC JD09 [3, 5]
答案 2 :(得分:0)
在处理行级操作和字符串时,没有矢量化方法可以做到这一点。
let使用str.find
和np.char.find
创建数据框。
#Note I've removed the spaces in your columns.
s = pd.DataFrame(df.apply(lambda x : x['ColA'].split(x['ColB']),axis=1).tolist())
idx = df.apply(lambda x : np.char.find(x['ColA'],x['ColB']),1)
pos = zip(idx.values, (idx - 1 + df["ColB"].str.len()).values)
df["Position"] = list(pos)
df['Proceeding Chars'], df['Following Chars'] = s[0], s[1]
print(df)
ColA ColB Position Proceeding Chars Following Chars
0 AGHXXXJ002 XXX (3, 5) AGH J002
1 AGHGHJJ002 GHJ (3, 5) AGH J002
2 ABCRTGHP001 RTGH (3, 6) ABC P001
3 ABCDFFP01 DFF (3, 5) ABC P01
4 ABCXGHJD09 XGH (3, 5) ABC JD09
5 HGMQQUTV01 HGM (0, 2) QQUTV01
6 GBHUJJS099 BHU (1, 3) G JJS099