Question

我有两个数据帧，我想将df1与df2合并，其中df1包含一个URL，而df2包含一个URL列表。

df1和df2的形状不同

示例：

MONGO_URL

我希望在df2.urls中存在df1.url中的http://www.example.jp/pro/sanada16的情况下加入datafrmes。

我曾考虑过将列表按列制作，但是df2.urls中URL的数量不是唯一的。

我试图将与df2.urls匹配的df1.url子字符串添加到新列中，以便我可以加入新列，但无法正常工作：

df1 = pd.DataFrame({'url': ['http://www.example.jp/pro/sanada16']})
df2 = pd.DataFrame({'urls': ['[https://www.example.jp/pro/minoya, http://www.example.jp/pro/tokyo_kankan, http://www.example.jp/pro/briansawazakiphotography, http://www.example.jp/pro/r_masuda, http://www.example.jp/pro/sanada16, ......]']})

预期输出：

df2['match'] = df2['urls'].apply(lambda x: x if x in df1['url'])

使用postgresql我可以做到：

new_df = pd.DataFrame({'url': ['http://www.example.jp/pro/sanada16'], 'urls': ['[https://www.example.jp/pro/minoya, http://www.example.jp/pro/tokyo_kankan, http://www.example.jp/pro/briansawazakiphotography, http://www.example.jp/pro/r_masuda, http://www.example.jp/pro/sanada16, ......]']})

Answer 1

如果我理解正确的话，这是一种方法。您可以遍历要搜索的模式，然后使用df.at存储匹配项。

import pandas as pd

data_1 = pd.DataFrame(
    {
        'url': ['http://www.ex.jp', 'http://www.ex.com']
    }
)

data_2 = pd.DataFrame(
    {
        'url': ['http://www.ex.jp/pro', 'http://www.ex.jp/pro/test', 'http://www.ex.com/path', 'http://www.ex.com/home']
    }
)

result = pd.DataFrame(columns = ['pattern', 'matches'])

for i in range(data_1.shape[0]):

    result.loc[i, 'pattern'] = data_1.loc[i, 'url']

    result.at[i, 'matches'] = [j for j in data_2['url'] if data_1.loc[i, 'url'] in j]

print(result)

礼物：

             pattern                                            matches
0   http://www.ex.jp  [http://www.ex.jp/pro, http://www.ex.jp/pro/test]
1  http://www.ex.com   [http://www.ex.com/path, http://www.ex.com/home]

根据要求更新问题的荣誉。

熊猫：如果df2的字符串中存在df1中的子字符串，则联接两个数据帧（如果字符串包含子字符串）

1 个答案: