在pandas中如何搜索单词和短语来创建新的数据框?

时间:2018-05-22 18:30:01

标签: python pandas

在Python3和pandas中我有这个数据帧:

bens_gerais_candidatos_2014.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6400 entries, 0 to 6399
Data columns (total 12 columns):
uf_x               6400 non-null object
cargo              6400 non-null object
nome_completo      6400 non-null object
sequencial         6400 non-null object
cpf                6400 non-null object
nome_urna          6400 non-null object
partido_eleicao    6400 non-null object
situacao           6400 non-null object
uf_y               6400 non-null object
descricao          6400 non-null object
detalhe            6400 non-null object
valor              6400 non-null float64
dtypes: float64(1), object(11)
memory usage: 650.0+ KB

我需要选择“detalhe”列中包含单词或短语的行:“LOTE RURAL”或“FAZENDA”或“IMOVEL RURAL”或“GLEBA”或“AREA RURAL”或“AREA NO LOTEAMENTO”

最初我考虑过选择每个部分:

mask = bens_gerais_candidatos_2014['detalhe'].str.contains("LOTE RURAL", na=False)
parte1 = bens_gerais_candidatos_2014[mask]

mask = bens_gerais_candidatos_2014['detalhe'].str.contains("FAZENDA", na=False)
parte2 = bens_gerais_candidatos_2014[mask]

等等。然后将这些行与几个合并合并:

areas1 = pd.merge(parte1, parte2, left_on='cpf', right_on='cpf', how='outer')
areas2 = pd.merge(areas1, parte3, left_on='cpf', right_on='cpf', how='outer')

...

请问,还有另一种更简单的方法来查找单词和短语来创建新的数据框吗?

没有重复的行 - 例如,有些情况下“LOTE RURAL”出现在一行中,而其他“LOTE RURAL”出现在“FAZENDA”中,或者只出现“FAZENDA”的情况。像这样:

"LOTE RURAL 42"
"LOTE RURAL 38, DENOMINADO FAZENDA CATARINA"
"FAZENDA ÁGUA VERMELHA"

2 个答案:

答案 0 :(得分:2)

我认为你可以做到:

str_choice = "LOTE RURAL|FAZENDA|IMOVEL RURAL" 
bens_gerais_candidatos_2014[bens_gerais_candidatos_2014['detalhe'].\
                               str.contains(str_choice, na=False)]

符号|str_choice中表示“或”,因此它可以获取您查找的所有不同字词,添加所需的|

答案 1 :(得分:2)

您可以尝试以下代码:

search_list = ["LOTE RURAL","FAZENDA","IMOVEL RURAL","GLEBA","AREA RURAL","AREA NO LOTEAMENTO"]

mask = bens_gerais_candidatos_2014['detalhe'].str.contains('|'.join(search_list))