Groupby 使用参数 Pandas 应用/转换自定义函数

时间:2021-05-19 16:38:25

标签: python pandas dataframe pandas-groupby

我正在做一些 NLP 工作,并且我正在尝试使用 groupby 在 lambda 函数内执行发布请求,并且正在收到一个 JSON 对象响应,不幸的是,结果为 NaN。我需要它在“爆炸”它们之后添加字段。

自定义函数:

def posTagger(text):
    post = { "text": title }
    endpoint = 'http://localhost:8001/api/postagger'
    r = requests.post(endpoint, json=post)
    r = r.json()
    time.sleep(1)
    return {"title": title, "result": r}


posTagger 返回值:

[
    {
        "text": "Contemporary Modern Soft Area Rugs Nonslip",
        "terms": [
            {
                "text": "Contemporary",
                "penn": "JJ",
                "tags": [
                    "Adjective"
                ]
            },
            {
                "text": "Modern",
                "penn": "NNP",
                "tags": [
                    "ProperNoun",
                    "Noun",
                    "Singular"
                ]
            },
            {
                "text": "Soft",
                "penn": "NNP",
                "tags": [
                    "ProperNoun",
                    "Noun",
                    "Singular"
                ]
            },
            {
                "text": "Area",
                "penn": "NN",
                "tags": [
                    "Singular",
                    "Noun",
                    "ProperNoun"
                ]
            },
            {
                "text": "Rugs",
                "penn": "NNP",
                "tags": [
                    "ProperNoun",
                    "Noun",
                    "Plural"
                ]
            },
            {
                "text": "Nonslip",
                "penn": "NNP",
                "tags": [
                    "ProperNoun",
                    "Noun",
                    "Singular"
                ]
            }
        ]
    }
]

数据帧

title = [
    'Contemporary Modern Soft Area Rugs Nonslip Velvet Home Room Carpet Floor Mat Rug', 
    'Traditional Distressed Area Rug 8x10 Large Rugs for Living Room 5x8 Gray Ivory', 
    'Shaggy Area Rugs Fluffy Tie-Dye Floor Soft Carpet Living Room Bedroom Large Rug'
    ]
df = pd.DataFrame(title, columns=['title'])
df

# Initial dataframe:

# title
# 0 Contemporary Modern Soft Area Rugs Nonslip...
# 1 Traditional Distressed Area Rug 8x10 Large...
# 2 Shaggy Area Rugs Fluffy Tie-Dye Floor Soft...

所以,这是我使用 .apply 的组:

df['result'] = pd.DataFrame(df.groupby(['title']).apply(lambda x: posTagger(x)))
df

# Resulting DataFrame after **.apply**:

#   title   result
# 0 Contemporary Modern Soft Area Rugs Nonslip Vel...   NaN
# 1 Traditional Distressed Area Rug 8x10 Large Rug...   NaN
# 2 Shaggy Area Rugs Fluffy Tie-Dye Floor Soft Car...   NaN

所以,这是我使用 .transform 的组:

df['result'] = pd.DataFrame(df.groupby(['title']).transform(lambda x: posTagger(x)))
df

# Resulting DataFrame after **.transform**:

# title result
# 0 Contemporary Modern Soft Area Rugs Nonslip Vel...   {'title': ['Contemporary Modern Soft Area Rugs...
# 1 Traditional Distressed Area Rug 8x10 Large Rug...   {'title': ['Contemporary Modern Soft Area Rugs...
# 2 Shaggy Area Rugs Fluffy Tie-Dye Floor Soft Car...   {'title': ['Contemporary Modern Soft Area Rugs...

请注意,.transform 的结果多次发送相同的值为什么?

  1. 如何从要以分解形式添加到与新列相同的数据帧的自定义函数(返回带有嵌套数组的对象)的返回值?
  2. 使用 .apply 还是 .transform 来实现这一点更好?

1 个答案:

答案 0 :(得分:1)

我将在这里讨论 apply(),您需要考虑几个注意事项。

对于您当前的函数,要获得该结果(即字典),您可以使用编写的函数并更改代码以调用它。除非它们是相同的,否则您不会真正按标题分组,因此只需使用 apply() 而不使用 groupby()。这不会爆炸字典。有很多方法可以考虑这一点。

def posTagger(text):
    post = { "text": title }
    endpoint = 'http://localhost:8001/api/postagger'
    r = requests.post(endpoint, json=post)
    r = r.json()
    time.sleep(1)
    return {"title": title, "result": r}

df['result'] = df.apply(lambda x: posTagger(x))

现在,如果您确实想使用 groupby().apply(),请将数据帧组作为 x 发送,对其进行操作,然后返回 x。这没有经过测试,但这是思考这个问题的一种方式。

def posTagger(x):
    post = { "text": x['title'] }
    endpoint = 'http://localhost:8001/api/postagger'
    r = requests.post(endpoint, json=post)
    r = r.json()
    time.sleep(1)
    x['result'] = {"title": x['title'], "result": r}
    # or you may be able code in the explode here using something like
    # dftemp = pd.DataFrame({"title": x['title'], "result": r})
    # merging x = x.merge(dftemp)
    # not tested at all but this would return x to the original dataframe
    return x

df = df.groupby(['title']).apply(lambda x: posTagger(x))
相关问题