我有一个
的json文件{
"London": "Location A",
"Berlin": "Location B"
}
我还有一个具有2列的数据框
Canberra is the capital of Australia AUS 1
Berlin is the capital of Germany GER 1
London is the capital of United Kingdom UK 1
Berlin is also the art capital of Germany GER 1
There is a direct flight from berlin to london OTH 1
Interstate train service are halted OTH 0
我试图遍历json
的键,并选择包含字符串(完全匹配)的所有行作为当前键。
到目前为止我一直在尝试:
temp_df = pd.read_csv(fileName , header=None)
df = (temp_df[temp_df[2] == 1]).reset_index(drop=True)
print(df)
with open(jsonFileName, encoding='utf-8') as jsonFile:
jsonData = json.load(jsonFile)
for key in jsonData.keys():
print(key)
df2 = (df[df[0].str.lower().str.match(r"\b{}\b".format(key), case=False)]).reset_index(drop=True)
print(df2.head())
当我尝试使用contains
df2 = (df[df[0].str.lower().str.contains(r"\b{}\b".format(key), regex=True, case=False)]).reset_index(drop=True)
print(df2.head())
预期产量:对于钥匙=伦敦
London is the capital of United Kingdom UK 1
There is a direct flight from berlin to london OTH 1
但是,它抛出的结果是翻倍的
London is the capital of United Kingdom UK 1
London is the capital of United Kingdom UK 1
There is a direct flight from berlin to london OTH 1
There is a direct flight from berlin to london OTH 1
关于此的任何指针都是有帮助的。
答案 0 :(得分:1)
对于您要执行的操作,我仍然不太清楚,但似乎您正在寻找一系列字符串的不区分大小写的匹配项。
这是使用Series.str.contains实现的一种方法。
with open(jsonFileName, encoding='utf-8') as jsonFile:
jsonData = json.load(jsonFile)
# convert the series of strings into lower-case
haystack = df[0].str.lower()
for key in jsonData.keys():
# convert the key to lower-case
needle = key.lower()
# create a boolean indexer of any records in the haystack containing the needle
matches = haystack.str.contains(needle)
# create a subset of the dataframe with only those rows
df2 = df[matches]
print(df2)
您还可以使用Series.apply进行更多自定义:
matches = haystack.apply(lambda x: needle in x)
以下是带有提供的示例数据的完整代码:
# setup the sample data objects
jsonData = {
"Berlin": "Location A",
"London": "Location B"
}
temp_df = pd.DataFrame([
{0: 'Canberra is the capital of Australia', 1: 'AUS', 2: 1},
{0: 'Berlin is the capital of Germany', 1: 'GER', 2: 1},
{0: 'London is the capital of United Kingdom', 1: 'UK', 2: 1},
{0: 'Berlin is also the art capital of Germany', 1: 'GER', 2: 1},
{0: 'There is a direct flight from berlin to london', 1: 'OTH', 2: 1},
{0: 'Interstate train service are halted', 1: 'OTH', 2: 0}
])
df = (temp_df[temp_df[2] == 1]).reset_index(drop=True)
# convert the series of strings into lower-case
haystack = df[0].str.lower()
for key in jsonData.keys():
# convert the key to lower-case
needle = key.lower()
# create a boolean indexer of any records in the haystack containing the needle
matches = haystack.str.contains(needle)
# create a subset of the dataframe with only those rows
df2 = df[matches]
print(df2)
输出:
0 1 2
2 London is the capital of United Kingdom UK 1
4 There is a direct flight from berlin to london OTH 1
0 1 2
1 Berlin is the capital of Germany GER 1
3 Berlin is also the art capital of Germany GER 1
4 There is a direct flight from berlin to london OTH 1