Question

Dataframe点击查看屏幕截图，因为我是新手，我需要10个声誉来嵌入图片

从csv文件导入Dataframe。＆＃39;类型＆＃39;和＆＃39;主题＆＃39;是项目的属性。＆＃39;标签＆＃39;是一个长字符串列，包含每个项目的混合（随机排序）标记（由＆＃39;，＆＃39;分隔）。基本上我需要做的是检查标签＆＃39;中是否有正确的主题标签（col_ {theme}）。列，如果没有，请将其添加到＆＃39;标记＆＃39;列。

例如：

第8项：有一个＆＃39; col_t3＆＃39;在＆＃39;标签＆＃39;专栏，其主题是＆＃39; t3＆＃39;。所以这是正确的，我们通过了。

第1项：有一个＆＃39; col_t1＆＃39;在＆＃39;标签＆＃39;列，但它的实际主题是＆＃39; t2＆＃39;，所以我需要替换＆＃39; col_t1＆＃39;与＆＃39; col_t2＆＃39;并在同一列中保持其他标签不变

第2项和第5项：没有＆＃39; col_ {theme}＆＃39;标记＆＃39;标记＆＃39;列，所以我添加了＆＃39; col_t1＆＃39;和＆＃39; col_t5＆＃39;他们的标签＆＃39;分别为。

请帮助!!

Answer 1

这会模拟您在屏幕截图中显示的输入：

import pandas as pd
import numpy as np

df = pd.DataFrame({"type": ["a", "c", "d", "a", "b", "a", "a", "c"], 
                  "tags": ["col_t1, col_red, large", np.nan, "col_t2, col_black, small", 
                           "col_t4, large, col_yellow", "col_gold, col_fancy,", "col_t1, thick, col_k",
                          np.nan, "col_t3, fancy, red"],
                  "theme": ["t2", "t1", "t2", "t3", "t2", "t1", np.nan, "t3"]})

df.set_index(np.arange(1, len(df)+1), inplace=True)
print df

输出：

                      tags theme type
1     col_t1, col_red, large    t2    a
2                        NaN    t1    c
3   col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow    t3    a
5       col_gold, col_fancy,    t2    b
6       col_t1, thick, col_k    t1    a
7                        NaN   NaN    a
8         col_t3, fancy, red    t3    c

产生所需输出的代码：

prefix = "col_"

# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():

    if pd.isnull(row.tags):
        # Replace NaN in tags column with a single tag from theme column 
        df.loc[row.Index, "tags"] = prefix + row.theme
    else:
        # Extract existing tags with prefix
        inferred_tags = [t.replace(prefix, "") for t in row.tags.split(",") if prefix in t] 

        if row.theme not in inferred_tags:
            df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme     
print df

输出：

                                tags theme type
1     col_t1, col_red, large, col_t2    t2    a
2                             col_t1    t1    c
3           col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow, col_t3    t3    a
5        col_gold, col_fancy, col_t2    t2    b
6               col_t1, thick, col_k    t1    a
7                                NaN   NaN    a
8                 col_t3, fancy, red    t3    c

希望这就是你要找的东西。据称，itertuples()对于遍历iterrows()以外的所有行的速度更快。另外，请注意我使用了numpy，特别是np.nan来模拟输入中的NaN，但如果您的数据来自csv，则不需要numpy。

---更新---

如评论中所述，代码应替换与主题匹配的标记。这是更新的解决方案：

prefix = "col_"

# Find all unique themes (notnull() excludes nan from the list)
themes = df[df["theme"].notnull()]["theme"].unique()

# Add prefex to all themes for comparison with tags; convert to set 
prefixed_themes = set([prefix + t for t in themes])

# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():

    if pd.isnull(row.tags):
        # Replace NaN in tags column with a single tag from theme column 
        df.loc[row.Index, "tags"] = prefix + row.theme
    else:
        # Extract existing tags with prefix (do not remove prefix; remove all spaces)
        inferred_tags = row.tags.replace(" ", "").split(",")

        # Use sets to check if there is any intersection between tags and themes
        if len(set(inferred_tags).intersection(prefixed_themes)) > 0:

            # Iterate over inferred_tags to find and replace matches with themes 
            for idx, t in enumerate(inferred_tags):
                if t in prefixed_themes:
                    inferred_tags[idx] = prefix + row.theme

            df.loc[row.Index, "tags"] = ", ".join(inferred_tags) 
        else:
            # In this case, add theme to tags (no replacement)
            df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme 

print df

输出：

                                tags theme type
1             col_t2, col_red, large    t2    a
2                             col_t1    t1    c
3           col_t2, col_black, small    t2    d
4  col_t4, large, col_yellow, col_t3    t3    a
5        col_gold, col_fancy, col_t2    t2    b
6               col_t1, thick, col_k    t1    a
7                                NaN   NaN    a
8                 col_t3, fancy, red    t3    c

请注意，代码会根据主题列中的所有值检查标记（添加前缀）;如果值（如t4）不在主题列中，则不会将其视为合法主题标记，因此在处理期间不会替换第4项中的col_t4。如果您需要更换所有col_t*，则需要具体说明。希望这是一个有用的解决方案，你可以从这里开始。

python数据帧根据其他列的条件替换列中的部分字符串

1 个答案: