删除熊猫中的“相似”行

时间:2019-07-11 13:32:30

标签: python pandas csv rows

我正在尝试删除数据框中类似的行。我的文件的数据收集中存在一些错误,所以我遇到了这个问题:

Dates   Last Price  Relative Share Price Momentum   RSI 30 Day  Relative 3 Month Eqty/Index     Relative 1 Month Eqty/Index     Sales/Diluted Sh    Revenue Growth Year over Year
1/31/2018   3881.0  -2.132  51.4152     4.526   -0.989  5.7376  -32.4057    0.6103  8.723   ...     1.3726  2.0628  0.9059  16.7236     2.6494  2.7217  26.2718     9.9759  17.553  23.475
2/28/2018   3883.0  3.251   51.4332     10.254  4.225   5.7376  -32.4057    0.6103  8.803   ...     1.3726  2.0852  0.8181  16.7322     2.6507  2.7231  26.2718     9.9759  13.771  23.045
*3/1/2018*  3883.0  3.251   51.4332     10.254  4.225   8.8678  4.7481  -14.9557    8.803   ...     1.0180  2.0852  0.8181  16.7322     2.6507  2.7231  15.5694     9.1429  13.771  23.045
*3/30/2018* 3700.0  5.646   49.6923     0.773   -2.346  8.8678  4.7481  -14.9557    8.388   ...     1.0180  1.9431  0.8499  17.2796     2.4121  2.5267  15.5694     9.1429  15.880  22.033
4/30/2018   4281.0  6.475   54.7253     10.663  8.728   8.8678  4.7481  -14.9557    10.599  ...     1.0180  2.1033  1.1068  19.9930     2.7909  2.9234  15.5694     9.1429  28.096  21.213
5/31/2018   4215.0  13.367  54.0894     2.241   -3.708  8.8678

该数据应该是每月一次,但由于某些原因,数据中有一些点在同一月份中有两个值。

我想要这个:

Dates   Last Price  Relative Share Price Momentum   RSI 30 Day  Relative 3 Month Eqty/Index     Relative 1 Month Eqty/Index     Sales/Diluted Sh    Revenue Growth Year over Year
1/31/2018   3881.0  -2.132  51.4152     4.526   -0.989  5.7376  -32.4057    0.6103  8.723   ...     1.3726  2.0628  0.9059  16.7236     2.6494  2.7217  26.2718     9.9759  17.553  23.475
2/28/2018   3883.0  3.251   51.4332     10.254  4.225   5.7376  -32.4057    0.6103  8.803   ...     1.3726  2.0852  0.8181  16.7322     2.6507  2.7231  26.2718     9.9759  13.771  23.045
3/30/2018   3883.0  3.251   51.4332     10.254  4.225   8.8678  4.7481  -14.9557    8.803   ...     1.0180  2.0852  0.8181  16.7322     2.6507  2.7231  15.5694     9.1429  13.771  23.045
4/30/2018   4281.0  6.475   54.7253     10.663  8.728   8.8678  4.7481  -14.9557    10.599  ...     1.0180  2.1033  1.1068  19.9930     2.7909  2.9234  15.5694     9.1429  28.096  21.213
5/31/2018   4215.0  13.367  54.0894     2.241   -3.708  8.8678

我猜测我应该将df.drop_duplicatesdf.loc结合使用。 我需要写一个代码说;如果df ['Dates']中的“ month”与同一行连续两行,则删除其中的一行(哪一个tbh并不重要)。

EDIT2 :由于似乎没人知道答案,因此我再次更改了数据框:

  Month Day     Year    Price names     Variable   Variable   Variable
    1   31.0    1990.0  1.2143  AAPL    47.0287     -24.3754    3.5821  
    2   28.0    1990.0  1.2143  AAPL    47.0287     -19.8995    -0.8467     36.713  39.377
    3   31.0    1990.0  1.4375  AAPL    49.7818     18.7056     15.5790     0.3787  14.7951     40.891  42.742
    4   29.0    1990.0  1.4063  AAPL    49.4099     15.2067     0.5290  0.3787  ...     0.0371  0.7548  3.1297  14.7951     35.632  39.694
    4   30.0    1990.0  1.4732  AAPL    50.2341     11.4693     -4.0632     0.3787  ...     0.0371  0.7459  3.2787  14.7951     32.273  37.271
    5   31.0    1990.0  1.5982  AAPL    51.7520

现在有了这种格式,希望有人可以更轻松地解决它。如果df ['Month'] = df ['Year'] = df ['names'],我想删除行。

我的想法是可以做这样的事情:Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError

我没有运气试图这样做:

df = df.drop(df[(df.Month == df.Year) & (df.Month == df.names)].index)

EDIT2:我能够做到这一点:

df[~df.duplicated(['Month', 'Year', 'Name'], keep=False)]

完全删除具有重复月份的行,但是它不保留一行,而只是删除两行,这并不是我想要的。也许有人可以对此进行调整,以使其中一行保留下来?

感谢所有帮助!

1 个答案:

答案 0 :(得分:0)

尝试使用df.query查询数据框

df = df.query("(month != year) & (month != names)")

由于不同股票名称存在相同月份,请尝试根据名称对数据进行分组并标记重复的行

# marks the rows with duplicate months within a stock name group
df['duplicate_months'] = df.groupby('name')['Month'].diff().fillna(1).clip(0,1)

# querying the df would eliminate these duplicate rows
# keeps the month row which is marked as 1 in the df 
df = df.query('duplicate_months != 0')