例如 用于拖车列
target read
AATGGCATC AATGGCATG
AATGATATA AAGGATATA
AATGATGTA CATGATGTA
我要添加列
target read differnces
AATGGCATC AATGGCATG (C,G,8)
AATGATATA AAGGATATA (T,G,3)
AATGATGTA CATGATGTA (A,G,0)
答案 0 :(得分:2)
让每个单词分裂(从而删除初始空白)并创建一个堆叠的数据框,在这里我们可以使用累积计数来计数每个事件,并删除所有重复项,最后创建我们的元组。
此处的关键功能将是explode
,str_split
,stack
和drop_duplicates
s = (
df.stack()
.str.split("")
.explode()
.to_frame("words")
.replace("", np.nan, regex=True)
.dropna()
)
s['enum'] = s.groupby(level=[0,1]).cumcount()
df["diff"] = (
s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
]
.groupby("level_0")
.agg(words=("words", ",".join), pos=("enum", "first"))
.agg(tuple, axis=1)
)
print(df)
target read diff
0 AATGGCATC AATGGCATG (C,G, 8)
1 AATGATATA AAGGATATA (T,G, 2)
2 AATGATGTA CATGATGTA (A,C, 0)
print(s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])
level_0 words enum
target 0 C 8
read 0 G 8
target 1 T 2
read 1 G 2
target 2 A 0
read 2 C 0
答案 1 :(得分:1)
我认为这个简单的功能可能会对您有所帮助 (请记住,这不是向量化的方法):
import pandas as pd
import difflib as dl
# create a dataframe
# pass the columns as argument to the function below
# df refers to the data frame
def differences(a,b):
differences=[]
for i in range(len(a)):
l=list(dl.ndiff(a[i].strip(),b[i].strip()))
temp=[x[2] for x in l if x[0]!=' ' ]
for x in l:
if x[0]=='-' or x[0]=='+':
temp.append(l.index(x))
differences.append(tuple(temp[:3]))
return differences
df['differences']=differences(df['target'],df['read'])
print(df)