我现在正在将我的数据分析管道从一个宽大的格式调整为整齐的/长格式,并且在过滤它时遇到了问题,我只是无法解决问题。
我的数据(简化后的数据)看起来像这样(显微镜强度数据):在组的每个测量中,我有几个感兴趣的区域= roi 我正在查看多个时间点上的强度(= 值)。
roi 基本上是显微镜图像中的单个细胞。我正在跟踪强度(= value )随时间(= timepoint )的变化。我重复此实验几次(= measurement ),每次查看几个单元格(= roi )。
我的目标是针对所有时间点过滤出那些测量值的ROI,其强度值高于我在时间点0设置的阈值(我认为这些ROI已被预先激活)。
data = { "timepoint": [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
"measurement": [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3],
"roi":[1,1,1,1,2,2,2,2,3,3,3,3,1,1,1,1,1,1,1,1,2,2,2,2],
"value":[0.1,0.2,0.3,0.4,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,0.1,0.2,0.3,0.4],
"group": "control"
}
df = pd.DataFrame(data)
df
返回
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
8 0 1 3 0.5 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
现在,我可以选择包含ROI的行,其时间点0的值比我的阈值高
threshold = 0.4
pre_activated = df.loc[(df['timepoint'] == 0) & (df['value'] > threshold)]
pre_activated
返回
timepoint measurement roi value group
8 0 1 3 0.5 control
16 0 3 1 0.5 control
现在,我想从原始数据帧df
中过滤掉所有时间点0到3的那些单元格(例如测量1,ROI 3)-这就是我现在停留的地方。
如果我使用.isin
df.loc[~(df['measurement'].isin(pre_activated["measurement"]) & df['roi'].isin(pre_activated["roi"]))]
我会接近的,但measurement 1
和roi 1
对的所有内容都丢失了(所以我认为这是条件表达式的问题)
timepoint measurement roi value group
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
我知道我可以使用.query
进行至少一个测量和投资回报率对
df[~df.isin(df.query('measurement == 1 & roi == 3'))]
,虽然所有整数都转换为浮点数,但会稍微接近。另外,“组”列现在为NaN,当有多个组的每个数据帧具有多个测量值和rois时,这将变得困难
timepoint measurement roi value group
0 0.0 1.0 1.0 0.1 control
1 1.0 1.0 1.0 0.2 control
2 2.0 1.0 1.0 0.3 control
3 3.0 1.0 1.0 0.4 control
4 0.0 1.0 2.0 0.1 control
5 1.0 1.0 2.0 0.2 control
6 2.0 1.0 2.0 0.3 control
7 3.0 1.0 2.0 0.4 control
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 0.0 2.0 1.0 0.1 control
13 1.0 2.0 1.0 0.2 control
14 2.0 2.0 1.0 0.3 control
15 3.0 2.0 1.0 0.4 control
16 0.0 3.0 1.0 0.5 control
17 1.0 3.0 1.0 0.6 control
18 2.0 3.0 1.0 0.8 control
19 3.0 3.0 1.0 0.9 control
20 0.0 3.0 2.0 0.1 control
21 1.0 3.0 2.0 0.2 control
22 2.0 3.0 2.0 0.3 control
23 3.0 3.0 2.0 0.4 control
我试图使用一种存储measurement
:roi
对的字典来避免任何混淆,但是我真的不知道这是否有用:
msmt_list = pre_activated["measurement"].values
roi_list = pre_activated["roi"].values
mydict={}
for i in range(len(msmt_list)):
mydict[msmt_list[i]]=roi_list[i]
输出
mydict
{1: 3, 3: 1}
实现我想做的最好的方法是什么?我会很感激任何投入,包括效率,因为我通常会处理3-4个小组,进行4-8个测量,每个小组最多200个ROI,通常是360个时间点。
谢谢!
编辑: 只是为了阐明我想要的输出数据帧应该是什么样子
“ df_pre_activated”(即“ roi”,其值在时间点0高于我的阈值)
timepoint measurement roi value group
8 0 1 3 0.5 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
“ df_filtered”(基本上是初始的“ df”,其中没有上面显示的“ df_pre_activated”中的数据)
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
答案 0 :(得分:2)
解决方案如下:
首先,我们通过过滤条件df_pre_activated_t0
来计算df
:
threshold = 0.4
df_pre_activated_t0 = df[(df['timepoint'] == 0) & (df['value'] > threshold)]
df_pre_activated_t0
看起来像这样:
timepoint measurement roi value group
8 0 1 3 0.5 control
16 0 3 1 0.5 control
我们通过合并df_pre_activated
和df
(内部合并)来计算df_pre_activated_t0
:
df_pre_activated = df.merge(
df_pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"]
)
df_pre_activated
看起来像这样:
timepoint measurement roi value group
0 0 1 3 0.5 control
1 1 1 3 0.6 control
2 2 1 3 0.8 control
3 3 1 3 0.9 control
4 0 3 1 0.5 control
5 1 3 1 0.6 control
6 2 3 1 0.8 control
7 3 3 1 0.9 control
要计算df_filtered
(df
(不包含df_pre_activated
行),我们在df
和{{1}之间进行 left 合并},并保留df_pre_activated
中值为不的行:
df_pre_activated
df_filtered = df.merge(
df_pre_activated,
how="left",
on=["timepoint", "measurement", "roi", "value"]
)
df_filtered = df_filtered[pd.isna(df_filtered["group_y"])]
看起来像这样:
df_filtered
最后,我们删除 group_y 列,并将列名称设置为其原始值:
timepoint measurement roi value group_x group_y
0 0 1 1 0.1 control NaN
1 1 1 1 0.2 control NaN
2 2 1 1 0.3 control NaN
3 3 1 1 0.4 control NaN
4 0 1 2 0.1 control NaN
5 1 1 2 0.2 control NaN
6 2 1 2 0.3 control NaN
7 3 1 2 0.4 control NaN
12 0 2 1 0.1 control NaN
13 1 2 1 0.2 control NaN
14 2 2 1 0.3 control NaN
15 3 2 1 0.4 control NaN
20 0 3 2 0.1 control NaN
21 1 3 2 0.2 control NaN
22 2 3 2 0.3 control NaN
23 3 3 2 0.4 control NaN
df_filtered.drop("group_y", axis=1, inplace=True)
df_filtered.columns = list(df.columns)
看起来像这样:
df_filtered
答案 1 :(得分:0)
就是这样:
在:
df[(df["measurement"] != 1) | (df["roi"] != 3)]
出局:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
16 0 3 1 0.5 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
这是由于数学逻辑思维而发生的。你在想向我显示a不为1且b不为3的数据框,与向我显示不a为1或b为3的数据框相同,从数据框中删除1和3。
您必须使用a不是1或b不是3,这与not a是1和b不是3相同。
希望这会有所帮助。一行。
编辑:要同时删除1:3和3:1,请将AND条件与两个OR条件同时使用:
df[((df["measurement"] != 1) | (df["roi"] != 3)) & ((df["measurement"] != 3) | (df["roi"] != 1))]
Edit2:要直接删除已过滤的行,可以使用先过滤再删除的逆过程。
在:
threshold = 0.4
full_activated = df5[(df5['timepoint'] != 0) | (df5['value'] < threshold)]
full_activated
出局:
timepoint measurement roi value group
0 0 1 1 0.1 control
1 1 1 1 0.2 control
2 2 1 1 0.3 control
3 3 1 1 0.4 control
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
9 1 1 3 0.6 control
10 2 1 3 0.8 control
11 3 1 3 0.9 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
17 1 3 1 0.6 control
18 2 3 1 0.8 control
19 3 3 1 0.9 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
编辑3:
多种情况
threshold = 0.4
full_activated = df5[((df5['timepoint'] != 0) | (df5['value'] < threshold)) & ((df5["measurement"] != 1) | (df5["roi"] != 3)) & ((df5["measurement"] != 3) | (df5["roi"] != 1)) & ((df5["measurement"] != 1) | (df5["roi"] != 1)) ]
full_activated
输出:
timepoint measurement roi value group
4 0 1 2 0.1 control
5 1 1 2 0.2 control
6 2 1 2 0.3 control
7 3 1 2 0.4 control
12 0 2 1 0.1 control
13 1 2 1 0.2 control
14 2 2 1 0.3 control
15 3 2 1 0.4 control
20 0 3 2 0.1 control
21 1 3 2 0.2 control
22 2 3 2 0.3 control
23 3 3 2 0.4 control
答案 2 :(得分:0)
感谢@Jose A. Jimenez和@Vioxini的回答。我接受了Jose的建议,它给了我想要的输出。我使用dask
inputdf.shape
(73124, 5)
仅使用熊猫:
import pandas as pd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
filtereddf = inputdf.merge(
pre_activated,
how="left",
on=["timepoint", "measurement", "roi", "value"],
)
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)
需要2分9秒。
现在有dask
:
import dask.dataframe as dd
threshold = 0.4
pre_activated_t0 = inputdf[(inputdf['timepoint'] == 0) & (inputdf['value'] > threshold)]
pre_activated = inputdf.merge(pre_activated_t0[["measurement", "roi"]], how="inner", on=["measurement", "roi"])
input_dd = dd.from_pandas(inputdf, npartitions=3)
pre_dd = dd.from_pandas(pre_activated, npartitions=3)
merger = dd.merge(input_dd,pre_dd, how="left", on=["timepoint", "measurement", "roi", "value"])
filtereddf = merger.compute()
filtereddf = filtereddf[pd.isna(filtereddf["group_y"])]
filtereddf.drop("group_y", axis=1, inplace=True)
filtereddf.columns = list(inputdf.columns)
现在只需要42.6 s:-)
这是我第一次使用dask,所以可能有些我不知道的选项可以进一步提高速度,但目前还可以。
再次感谢您的帮助!
编辑:
在将npartitions
转换为pandas dataframe
并将其从3增加到dask dataframe
时,我使用了npartitions=30
选项,从而进一步提高了性能:现在仅需9.87秒