如何使用pandas仅比较变量的一部分?

时间:2017-11-09 19:21:09

标签: python pandas

我需要比较两个数据帧中的代码。我正在使用Python 3和pandas

在第一个基地,代码总是有18位数字:

dividas_dep = pd.read_csv("dividas_deputados_ajustado_csv.csv",sep=';',encoding = 'latin_1')

dividas_dep.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106 entries, 0 to 105
Data columns (total 10 columns):
CPF_Deputado                  106 non-null object
CPF_limpo                     106 non-null int64
Nome_Deputado                 106 non-null object
Vinculo                       106 non-null object
CNPJ_Devedor                  106 non-null object
CNPJ_limpo                    106 non-null int64
Nome_Devedor                  106 non-null object
Valores_situacao_Irregular    65 non-null object
Valores_situacao_Regular      52 non-null object
Total_Devido                  106 non-null object
dtypes: int64(2), object(8)
memory usage: 8.4+ KB

要在此第一个基础(“CNPJ_Devedor”)中进行比较的列具有以下示例:17.080.201 / 0001-49,76.205.723 / 0001-99,04.885.828 / 0001-25 ......

在第二个基地,代码总是有10位数字:

funrural = pd.read_excel('DEVEDORES FUNRURAL ATUALIZADO PGFN.xlsx')

funrural.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8130 entries, 0 to 8129
Data columns (total 14 columns):
PSFN_PGFN               8129 non-null object
Regiao                  8129 non-null object
CNPJ_CEI_Tipo           8129 non-null object
CNPJ_Raiz               8129 non-null object
Razao_Social            8130 non-null object
Valor_principal         8130 non-null float64
Valor_TR_IPC_Poup       8130 non-null float64
Valor_Juros             8130 non-null float64
Valor_SELIC             8130 non-null float64
Valor_Encargo           8130 non-null float64
Valor_Multa_Oficio      8130 non-null float64
Valor_Selic_M_Oficio    8130 non-null float64
Vl_Multa_Mora           8130 non-null float64
Vl_Tot_Credito          8130 non-null float64
dtypes: float64(9), object(5)
memory usage: 889.3+ KB

要在此第二个基础(“CNPJ_Raiz”)中进行比较的列具有以下示例:04.244.173,05.006.407,03.632.132 ......

代码“CNPJ_Devedor”和“CNPJ_Raiz”在税法中有关,但我不能像这样进行简单的合并:

compara1 = pd.merge(dividas_dep, funrural, left_on='CNPJ_Devedor', right_on='CNPJ_Raiz')

我需要做的是只比较“CNPJ_Devedor”的前10位数字和代码“CNPJ_Raiz”(例如,在“17.080.201 / 0001-49”中仅使用“17.080.201”)

有没有办法在Python中执行此操作?或者我应该编辑原始数据框文件dividas_dep(dividas_deputados_ajustado_csv.csv),以创建只有前10位的新列?

1 个答案:

答案 0 :(得分:0)

您可以将前10个字符串元素的切片与.str.slice(None, 10)进行比较:

dividas_dep["CNPJ_Devedor"].str.slice(None, 10) == funrural["CNPJ_Raiz"]

示例:

>>> dividas_dep = pd.DataFrame({"CNPJ_Devedor": ['17.080.201/0001-49', '76.205.723/0001-99', '04.885.828/0001-25']})
>>> funrural = pd.DataFrame({"CNPJ_Raiz": ['17.080.201', '04.244.173', '05.006.407']})
>>> dividas_dep["CNPJ_Devedor"].str.slice(None, 10) == funrural["CNPJ_Raiz"]
0     True
1    False
2    False
dtype: bool

您可以使用结果创建新的数据框:

res = dividas_dep["CNPJ_Devedor"].str.slice(None, 10) == funrural["CNPJ_Raiz"]
funrural[res]