R unnest_token()对Python熊猫str.split()

时间:2020-06-22 10:00:46

标签: python r regex pandas tidytext

我想使用python pandas复制类似于下面的df_long的结果。这是R代码:

df <- data.frame("id" = 1, "author" = 'trump', "Tweet" = "RT @kin2souls: @KimStrassel Anyone that votes")

unnest_regex  <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

df_long <- df %>%
  unnest_tokens(
    word, Tweet, token = "regex", pattern = unnest_regex)

如果我理解正确,unnest_regex的编写方式还可以找到数字(在空格和少量标点符号之间)。我不明白为什么R会将字符串中的数字(例如“ @ kin2souls”)视为不匹配条件。因此,我们在df_long中将@ kin2souls单独作为结果。但是,当我尝试在熊猫中复制它时:

unnest_regex = r"([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

df = df_long.assign(word=df['Tweet'].str.split(unnest_regex)).explode('word')
df.drop("Tweet", axis=1, inplace=True)

它将把“ @ kin2souls”字符串分成“ @kin”和“ souls”作为单独的行。此外,由于unnest_regex使用捕获括号,因此在Python中,我将其修改为:

unnest_regex = r"[^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@])"

这样可以避免出现空字符串。我想知道这是否也是一个促成因素。但是,仍会发生“ 2”分割。谁能用Python提出解决方案,并可能解释R为什么这样做?谢谢!

以下是Python中的数据:

data = {'id':[1], "author":["trump"], "Tweet": ["RT @kin2souls: @KimStrassel Anyone that votes"]}
df = pd.DataFrame.from_dict(data)

预期结果:

data_long = {'id':[1,1,1,1,1,1], "author":["trump","trump","trump","trump","trump","trump"], "word": ["rt", "@kin2souls", "@kimstrassel", "anyone", "that", "votes"]}
df_long = pd.DataFrame.from_dict(data_long)

1 个答案:

答案 0 :(得分:0)

str splitexplode的组合应该复制您的输出:

spec <- ugarchspec(mean.model=list(armaOrder=c(0,0)), 
                        variance.model=list(model="gjrGarch",
                                            garchOrder = c(1,1)),
                        distribution="sstd")
Error: ugarchspec-->error: the garch model does not appear to be a valid choice.

我利用了以下事实:文本由空格分隔,偶尔还有(df .assign(Tweet=df.Tweet.str.lower().str.split(r"[:\s]")) .explode("Tweet") .query('Tweet != ""') .reset_index(drop=True) ) id author Tweet 0 1 trump rt 1 1 trump @kin2souls 2 1 trump @kimstrassel 3 1 trump anyone 4 1 trump that 5 1 trump votes

或者,您可以使用str extractall-我觉得它有点长:

:

不确定( df.set_index(["id", "author"]) .Tweet.str.lower() .str.extractall(r"\s*([a-z@\d]+)[:\s]*") .droplevel(-1) .rename(columns={0: "Tweet"}) .reset_index() ) 与正则表达式如何配合使用-也许其他人可以解决该问题