我正在尝试删除数据框中类似的行。我的文件的数据收集中存在一些错误,所以我遇到了这个问题:
Dates Last Price Relative Share Price Momentum RSI 30 Day Relative 3 Month Eqty/Index Relative 1 Month Eqty/Index Sales/Diluted Sh Revenue Growth Year over Year
1/31/2018 3881.0 -2.132 51.4152 4.526 -0.989 5.7376 -32.4057 0.6103 8.723 ... 1.3726 2.0628 0.9059 16.7236 2.6494 2.7217 26.2718 9.9759 17.553 23.475
2/28/2018 3883.0 3.251 51.4332 10.254 4.225 5.7376 -32.4057 0.6103 8.803 ... 1.3726 2.0852 0.8181 16.7322 2.6507 2.7231 26.2718 9.9759 13.771 23.045
*3/1/2018* 3883.0 3.251 51.4332 10.254 4.225 8.8678 4.7481 -14.9557 8.803 ... 1.0180 2.0852 0.8181 16.7322 2.6507 2.7231 15.5694 9.1429 13.771 23.045
*3/30/2018* 3700.0 5.646 49.6923 0.773 -2.346 8.8678 4.7481 -14.9557 8.388 ... 1.0180 1.9431 0.8499 17.2796 2.4121 2.5267 15.5694 9.1429 15.880 22.033
4/30/2018 4281.0 6.475 54.7253 10.663 8.728 8.8678 4.7481 -14.9557 10.599 ... 1.0180 2.1033 1.1068 19.9930 2.7909 2.9234 15.5694 9.1429 28.096 21.213
5/31/2018 4215.0 13.367 54.0894 2.241 -3.708 8.8678
该数据应该是每月一次,但由于某些原因,数据中有一些点在同一月份中有两个值。
我想要这个:
Dates Last Price Relative Share Price Momentum RSI 30 Day Relative 3 Month Eqty/Index Relative 1 Month Eqty/Index Sales/Diluted Sh Revenue Growth Year over Year
1/31/2018 3881.0 -2.132 51.4152 4.526 -0.989 5.7376 -32.4057 0.6103 8.723 ... 1.3726 2.0628 0.9059 16.7236 2.6494 2.7217 26.2718 9.9759 17.553 23.475
2/28/2018 3883.0 3.251 51.4332 10.254 4.225 5.7376 -32.4057 0.6103 8.803 ... 1.3726 2.0852 0.8181 16.7322 2.6507 2.7231 26.2718 9.9759 13.771 23.045
3/30/2018 3883.0 3.251 51.4332 10.254 4.225 8.8678 4.7481 -14.9557 8.803 ... 1.0180 2.0852 0.8181 16.7322 2.6507 2.7231 15.5694 9.1429 13.771 23.045
4/30/2018 4281.0 6.475 54.7253 10.663 8.728 8.8678 4.7481 -14.9557 10.599 ... 1.0180 2.1033 1.1068 19.9930 2.7909 2.9234 15.5694 9.1429 28.096 21.213
5/31/2018 4215.0 13.367 54.0894 2.241 -3.708 8.8678
我猜测我应该将df.drop_duplicates
与df.loc
结合使用。
我需要写一个代码说;如果df ['Dates']中的“ month”与同一行连续两行,则删除其中的一行(哪一个tbh并不重要)。
EDIT2 :由于似乎没人知道答案,因此我再次更改了数据框:
Month Day Year Price names Variable Variable Variable
1 31.0 1990.0 1.2143 AAPL 47.0287 -24.3754 3.5821
2 28.0 1990.0 1.2143 AAPL 47.0287 -19.8995 -0.8467 36.713 39.377
3 31.0 1990.0 1.4375 AAPL 49.7818 18.7056 15.5790 0.3787 14.7951 40.891 42.742
4 29.0 1990.0 1.4063 AAPL 49.4099 15.2067 0.5290 0.3787 ... 0.0371 0.7548 3.1297 14.7951 35.632 39.694
4 30.0 1990.0 1.4732 AAPL 50.2341 11.4693 -4.0632 0.3787 ... 0.0371 0.7459 3.2787 14.7951 32.273 37.271
5 31.0 1990.0 1.5982 AAPL 51.7520
现在有了这种格式,希望有人可以更轻松地解决它。如果df ['Month'] = df ['Year'] = df ['names'],我想删除行。
我的想法是可以做这样的事情:Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError
我没有运气试图这样做:
df = df.drop(df[(df.Month == df.Year) & (df.Month == df.names)].index)
EDIT2:我能够做到这一点:
df[~df.duplicated(['Month', 'Year', 'Name'], keep=False)]
完全删除具有重复月份的行,但是它不保留一行,而只是删除两行,这并不是我想要的。也许有人可以对此进行调整,以使其中一行保留下来?
感谢所有帮助!
答案 0 :(得分:0)
尝试使用df.query查询数据框
df = df.query("(month != year) & (month != names)")
由于不同股票名称存在相同月份,请尝试根据名称对数据进行分组并标记重复的行
# marks the rows with duplicate months within a stock name group
df['duplicate_months'] = df.groupby('name')['Month'].diff().fillna(1).clip(0,1)
# querying the df would eliminate these duplicate rows
# keeps the month row which is marked as 1 in the df
df = df.query('duplicate_months != 0')