如何遍历pandas df列,查找string是否包含来自单独pandas df列的任何字符串?

时间:2019-12-29 18:44:38

标签: python string pandas dataframe

我在python中有两个pandas DataFrames。 DF A包含一列,基本上是句子长度的字符串。

|---------------------|------------------|
|        sentenceCol  |    other column  |
|---------------------|------------------|
|'this is from france'|         15       |
|---------------------|------------------|

DF B包含一列国家列表

|---------------------|------------------|
|        country      |    other column  |
|---------------------|------------------|
|'france'             |         33       |
|---------------------|------------------|
|'spain'              |         34       |
|---------------------|------------------|

如何遍历DF A并指定字符串包含的国家/地区?这就是我想象的DF A在分配后的样子……

|---------------------|------------------|-----------|
|        sentenceCol  |    other column  | country   |
|---------------------|------------------|-----------|
|'this is from france'|         15       |  'france' |
|---------------------|------------------|-----------|

另一个复杂之处在于,每个句子可以有一个以上的国家,因此理想情况下,可以将每个适用的国家/地区分配给该句子。

|-------------------------------|------------------|-----------|
|        sentenceCol            |    other column  | country   |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'spain'  |
|-------------------------------|------------------|-----------|

1 个答案:

答案 0 :(得分:0)

您可以使用方法iterrows()遍历数据框。您可以尝试以下方法:

# Dataframes definition
df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})


# Create the new dataframe
df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
count=0

# Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
for index, row in df_2.iterrows():
    country = row.country

    for index_2, row_2 in df_1.iterrows():
        if country in row_2.sentence:
            df_3.loc[count] = (row_2.sentence, row_2.other, country)
            count+=1

所以输出是:

sentence                            other_column    country
0   this is from france and spain   15              spain
1   this is from france and spain   15              france
2   this is from france             12              france
3   this is from germany            33              germany