对于下面显示的数据帧s6,我需要:
1.在s6 .iloc [:,4:]中用np.nan替换所有包含0的单元格
2.在s6 .iloc [:,4:]中替换以数字<1结尾的所有单元格。 5使用np.nan,其中每个单元格中的值以_Q结尾,然后是数字。
因此对于此示例数据框:
col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 |
f1 f2 f3 f4 110_q9 111_q4 110_q8 111_q9
所需的输出如下所示:
col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 |
f1 f2 f3 f4 NaN Nan NaN 111_q9
我在以下方面尝试过很多变化,但没有成功:
s6.iloc[:,4:][s6.iloc[:,4:].str.contains('0')] <- np.nan
s6.iloc[:,4:] = s6.iloc[:,4:].replace('*0*', np.nan)
s6.iloc[:,4:] = s6.iloc[:,4:].replace('0',np.nan)
s6.iloc[:,4:] = s6.iloc[:,4:].replace(0,np.nan)
s6 = [out[out[f].str.split('_Q', expand=True)[1].astype(int) > 5] for f in out.columns if f not in col_list]
解决方案: 对于替换含有0的细胞的最终解决方案,我改变了答案,也删除了不含质量得分的细胞(未被称为碱基),并将此处作为如何将多个过滤器应用于大熊猫子集的示例数据帧。
import pandas as pd
for chunk in pd.read_csv(self.input_csv, sep=',', header=0, chunksize=chunksize):
# create id column
chunk["id"] = chunk.apply(lambda x : '{}_{}_{}'.format(x[1], x[2], x[3]), axis=1)
chunk.set_index("id", drop=True, inplace=True)
chunk.drop(["Features", "fov","x","y"], axis=1, inplace=True)
# count and remove uncalled bases
cols = [x for x in chunk.columns]
# coerce np array of strings to search
A = chunk[cols].values.astype(str)
# mask for uncalled bases on vectorized array
m1 = np.core.defchararray.find(A, '0') != -1
m2 = np.core.defchararray.find(A, '_Q') == -1
# apply mask and return filtered columns to df
chunk[cols] = np.where(m1|m2, '', chunk[cols])
# merge chunks into one dataframe
chunks.append(chunk)
csv= pd.concat(chunks, axis=0)
答案 0 :(得分:1)
这是通过循环感兴趣的列来实现它的一种方法:
import pandas as pd
import numpy as np
data = '''\
col1 col2 col3 col4 col5 col6 col7 col8
f1 f2 f3 f4 110_q9 111_q4 110_q8 111_q9 '''
s6 = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
for col in s6.columns[4:]:
m1 = s6[col].str.contains('0') # first mask
m2 = s6[col].str[-3:].str.match('_q[0-4]') # second mask
s6.loc[m1|m2,col] = np.nan # m1 or m2 --> np.nan
print(s6)
返回:
col1 col2 col3 col4 col5 col6 col7 col8
0 f1 f2 f3 f4 NaN NaN NaN 111_q9
答案 1 :(得分:1)
您可以使用numpy
进行矢量化方法。以下是一个最小的例子。如果找不到指定的字符,numpy.core.defchararray.find
将返回-1。
import numpy as np
df = pd.DataFrame([['ASFA', 'ASFA0341', '34120'],
['32432', 'SDAF', 'ADS0ADSF'],
['DJKFA', '0SADFSA', 'DAFADF']])
cols = [1, 2]
A = df[cols].values.astype(str)
mask = np.core.defchararray.find(A, '0') != -1
df[cols] = np.where(mask, np.nan, df[cols])
print(df)
0 1 2
0 ASFA NaN NaN
1 32432 SDAF NaN
2 DJKFA NaN DAFADF