我有一个具有A,B,C列的数据框。我想比较B和C列并创建两个列,分别为A-1(当前行年-1)和A-2(当前行年-2)其中A是年份列。
示例:在1971年的数据框中,B列具有苹果,橙色 C列只有苹果,1970年有B香蕉,苹果和C-apple。 现在,我们尝试为1971年的每一行生成A-1(1970)和A-2(1969)列的值。在(A-1)1970年,B,C列均捕获了苹果,因此我们在1971年的前两行中标记为“是”,然后在第三行中将其标记为“否”,因为1970年中没有橙色。
因此,对于每一行,我们考虑年份(例如1971),并检查B和C值,看看是否在year-1(1970)和year-2(1969)的C列中也捕获了该特定的B值,在这种情况下将是nan,因为我们在数据帧中没有任何1969的记录),并相应地对其进行了标记。
数据框:
A B C D
1971 apple apple yes
1971 apple apple yes
1971 orange nan no
1970 banana nan no
1970 apple apple yes
1972 mango mango yes
1972 banana banana yes
1972 orange orange yes
1972 apple apple yes
1973 banana nan no
1973 mango mango yes
1973 apple nan no
1974 orange nan no
输出:
A B C A-1 A-2
1971 apple apple yes nan
1971 apple apple yes nan
1971 orange nan no nan
1970 banana nan nan nan
1970 apple apple nan nan
1972 mango mango no no
1972 banana banana no no
1972 orange orange no no
1972 apple apple yes yes
1973 banana nan yes no
1973 mango mango yes no
1973 apple nan yes yes
1974 orange nan no yes
我不知道,请帮助我。
答案 0 :(得分:1)
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'A': [1971, 1971, 1971, 1970, 1970, 1972, 1972, 1972, 1972, 1973, 1973, 1973, 1974], 'B': ['apple', 'apple', 'orange', 'banana', 'apple', 'mango', 'banana', 'orange', 'apple', 'banana', 'mango', 'apple', 'orange'], 'C': ['apple', 'apple', nan, nan, 'apple', 'mango', 'banana', 'orange', 'apple', nan, 'mango', nan, nan]})
# add an index column to the DataFrame
df = df.reset_index()
df['BC'] = np.where(df['B'] == df['C'], df['B'], nan)
A_min = df['A'].min()
for i in [1, 2]:
col = 'A-{}'.format(i)
col2 = 'Y+{}'.format(i)
df[col2] = df['A']+i
# fill with nans
df[col] = nan
# place 'no' except where there is no data for the year A-i
mask = df['A']-i >= A_min
df.loc[mask, col] = 'no'
# place 'yes' where 'A','B' columns match 'Y+i','BC' columns
match = pd.merge(df[['A','B','index']], df[[col2, 'BC']],
left_on=['A','B'], right_on=[col2,'BC'])
df.loc[match['index'], col] = 'yes'
df = df.drop(['index', 'BC', 'Y+1', 'Y+2'], axis=1)
print(df)
收益
A B C A-1 A-2
0 1971 apple apple yes NaN
1 1971 apple apple yes NaN
2 1971 orange NaN no NaN
3 1970 banana NaN NaN NaN
4 1970 apple apple NaN NaN
5 1972 mango mango no no
6 1972 banana banana no no
7 1972 orange orange no no
8 1972 apple apple yes yes
9 1973 banana NaN yes no
10 1973 mango mango yes no
11 1973 apple NaN yes yes
12 1974 orange NaN no yes
首先,让我们向DataFrame添加一个索引列。目的将在以后变得更清楚。 (请注意,我在这里假设您的DataFrame的原始索引具有唯一值。稍后我们将依赖该属性...)
df = df.reset_index()
# index A B C
# 0 0 1971 apple apple
# 1 1 1971 apple apple
# 2 2 1971 orange NaN
# 3 3 1970 banana NaN
# 4 4 1970 apple apple
# 5 5 1972 mango mango
# 6 6 1972 banana banana
# 7 7 1972 orange orange
# 8 8 1972 apple apple
# 9 9 1973 banana NaN
# 10 10 1973 mango mango
# 11 11 1973 apple NaN
# 12 12 1974 orange NaN
由于我们要在B
和C
列中标识具有相同值 的行,因此我们创建一个BC
列当B
和B
相等时等于C
,而当NaN
不相等时等于In [123]: df['BC'] = np.where(df['B'] == df['C'], df['B'], nan)
In [124]: df
Out[124]:
index A B C BC
0 0 1971 apple apple apple
1 1 1971 apple apple apple
2 2 1971 orange NaN NaN
3 3 1970 banana NaN NaN
4 4 1970 apple apple apple
5 5 1972 mango mango mango
6 6 1972 banana banana banana
7 7 1972 orange orange orange
8 8 1972 apple apple apple
9 9 1973 banana NaN NaN
10 10 1973 mango mango mango
11 11 1973 apple NaN NaN
12 12 1974 orange NaN NaN
:
A
现在,我们将匹配不同年份的行,因此让我们添加一列来确定我们有兴趣比较的年份。例如,我们希望将Y+1
为1971的行与In [125]: df['Y+1'] = df['A']+1; df
Out[125]:
index A B C BC Y+1
0 0 1971 apple apple apple 1972
1 1 1971 apple apple apple 1972
2 2 1971 orange NaN NaN 1972
3 3 1970 banana NaN NaN 1971
4 4 1970 apple apple apple 1971
5 5 1972 mango mango mango 1973
6 6 1972 banana banana banana 1973
7 7 1972 orange orange orange 1973
8 8 1972 apple apple apple 1973
9 9 1973 banana NaN NaN 1974
10 10 1973 mango mango mango 1974
11 11 1973 apple NaN NaN 1974
12 12 1974 orange NaN NaN 1975
等于1971的行进行比较:
df
通过此设置,我们可以通过将A
与自身合并来标识应标记为“是”的行,
将列B
和Y+1
与列BC
和In [127]: pd.merge(df[['A','B','index']], df[['Y+1', 'BC']], left_on=['A','B'], right_on=['Y+1','BC'])
Out[127]:
A B index Y+1 BC
0 1971 apple 0 1971 apple
1 1971 apple 1 1971 apple
2 1972 apple 8 1972 apple
3 1972 apple 8 1972 apple
4 1973 banana 9 1973 banana
5 1973 mango 10 1973 mango
6 1973 apple 11 1973 apple
匹配:
index
请注意,yes
列指示在A-1
列中应包含df = df.reset_index()
的行索引。这是使用上面的const char*
的目的。没有它,合并时我们将失去原始索引。
答案 1 :(得分:0)
这是您问题的代码:-
import pandas as pd
import csv
colnames=['A','B','C']
df=pd.read_csv("file1.txt",header=None,delim_whitespace=True)
df.columns=colnames
#print(df.head())
# Below the lines of Code for subset extract in pandas.
df2=df.ix[:,['A','B','C']]
#print("first df2 head :",df2.head())
list1=[]
list2=[]
for i in range(len(df['A'])):
valoneless=(df['A'][i])-1
valtwoless=(df['A'][i]-2)
if (any(df['A']==valoneless)):
if (df['B'][i])==df['C'][i]:
val1='yes'
list1.append(val1)
else :
val1='No'
list1.append(val1)
else:
val1 = 'nan'
list1.append(val1)
if (any(df['A'] == valtwoless)):
if (df['B'][i]) == df['C'][i]:
val2 = 'yes'
list2.append(val2)
else:
val2 = 'No'
list2.append(val2)
else:
val2 = 'nan'
list2.append(val2)
df2['A-1']=list1
df2['A-2']=list2
print(df2.head())
df2.to_csv("firstcolumn.txt",index=False,sep=" ",na_rep='nan')