熊猫数据框str.match和str.contain

时间:2020-04-21 12:40:24

标签: python json pandas dataframe

我有一个

的json文件
{
    "London": "Location A",   
    "Berlin": "Location B"  
}

我还有一个具有2列的数据框

Canberra is the capital of Australia            AUS  1
Berlin is the capital of Germany                GER  1
London is the capital of United Kingdom         UK   1
Berlin is also the art capital of Germany       GER  1
There is a direct flight from berlin to london  OTH  1
Interstate train service are halted             OTH  0

我试图遍历json的键,并选择包含字符串(完全匹配)的所有行作为当前键。

到目前为止我一直在尝试:

temp_df = pd.read_csv(fileName , header=None)
df = (temp_df[temp_df[2] == 1]).reset_index(drop=True)
print(df)

with open(jsonFileName, encoding='utf-8') as jsonFile:
    jsonData = json.load(jsonFile)

for key in jsonData.keys():
    print(key)
    df2 = (df[df[0].str.lower().str.match(r"\b{}\b".format(key), case=False)]).reset_index(drop=True)
    print(df2.head())

当我尝试使用contains

 df2 = (df[df[0].str.lower().str.contains(r"\b{}\b".format(key), regex=True, case=False)]).reset_index(drop=True)
 print(df2.head())

预期产量:对于钥匙=伦敦

London is the capital of United Kingdom         UK   1
There is a direct flight from berlin to london  OTH  1

但是,它抛出的结果是翻倍的

London is the capital of United Kingdom         UK   1
London is the capital of United Kingdom         UK   1
There is a direct flight from berlin to london  OTH  1
There is a direct flight from berlin to london  OTH  1

关于此的任何指针都是有帮助的。

1 个答案:

答案 0 :(得分:1)

对于您要执行的操作,我仍然不太清楚,但似乎您正在寻找一系列字符串的不区分大小写的匹配项。

这是使用Series.str.contains实现的一种方法。

with open(jsonFileName, encoding='utf-8') as jsonFile:
    jsonData = json.load(jsonFile)

# convert the series of strings into lower-case
haystack = df[0].str.lower()

for key in jsonData.keys():

    # convert the key to lower-case
    needle = key.lower()

    # create a boolean indexer of any records in the haystack containing the needle
    matches = haystack.str.contains(needle)

    # create a subset of the dataframe with only those rows
    df2 = df[matches]
    print(df2)

您还可以使用Series.apply进行更多自定义:

    matches = haystack.apply(lambda x: needle in x)

以下是带有提供的示例数据的完整代码:

# setup the sample data objects
jsonData = {
    "Berlin": "Location A",
    "London": "Location B"
}

temp_df = pd.DataFrame([
    {0: 'Canberra is the capital of Australia', 1: 'AUS', 2: 1},
    {0: 'Berlin is the capital of Germany', 1: 'GER', 2: 1},
    {0: 'London is the capital of United Kingdom', 1: 'UK', 2: 1},
    {0: 'Berlin is also the art capital of Germany', 1: 'GER', 2: 1},
    {0: 'There is a direct flight from berlin to london', 1: 'OTH', 2: 1},
    {0: 'Interstate train service are halted', 1: 'OTH', 2: 0}
])

df = (temp_df[temp_df[2] == 1]).reset_index(drop=True)


# convert the series of strings into lower-case
haystack = df[0].str.lower()

for key in jsonData.keys():

    # convert the key to lower-case
    needle = key.lower()

    # create a boolean indexer of any records in the haystack containing the needle
    matches = haystack.str.contains(needle)

    # create a subset of the dataframe with only those rows
    df2 = df[matches]
    print(df2)

输出:

                                             0    1  2
2         London is the capital of United Kingdom   UK  1
4  There is a direct flight from berlin to london  OTH  1

                                                0    1  2
1                Berlin is the capital of Germany  GER  1
3       Berlin is also the art capital of Germany  GER  1
4  There is a direct flight from berlin to london  OTH  1