我想使用python pandas复制类似于下面的df_long的结果。这是R代码:
df <- data.frame("id" = 1, "author" = 'trump', "Tweet" = "RT @kin2souls: @KimStrassel Anyone that votes")
unnest_regex <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
df_long <- df %>%
unnest_tokens(
word, Tweet, token = "regex", pattern = unnest_regex)
如果我理解正确,unnest_regex的编写方式还可以找到数字(在空格和少量标点符号之间)。我不明白为什么R会将字符串中的数字(例如“ @ kin2souls”)视为不匹配条件。因此,我们在df_long中将@ kin2souls单独作为结果。但是,当我尝试在熊猫中复制它时:
unnest_regex = r"([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
df = df_long.assign(word=df['Tweet'].str.split(unnest_regex)).explode('word')
df.drop("Tweet", axis=1, inplace=True)
它将把“ @ kin2souls”字符串分成“ @kin”和“ souls”作为单独的行。此外,由于unnest_regex使用捕获括号,因此在Python中,我将其修改为:
unnest_regex = r"[^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@])"
这样可以避免出现空字符串。我想知道这是否也是一个促成因素。但是,仍会发生“ 2”分割。谁能用Python提出解决方案,并可能解释R为什么这样做?谢谢!
以下是Python中的数据:
data = {'id':[1], "author":["trump"], "Tweet": ["RT @kin2souls: @KimStrassel Anyone that votes"]}
df = pd.DataFrame.from_dict(data)
预期结果:
data_long = {'id':[1,1,1,1,1,1], "author":["trump","trump","trump","trump","trump","trump"], "word": ["rt", "@kin2souls", "@kimstrassel", "anyone", "that", "votes"]}
df_long = pd.DataFrame.from_dict(data_long)
答案 0 :(得分:0)
spec <- ugarchspec(mean.model=list(armaOrder=c(0,0)),
variance.model=list(model="gjrGarch",
garchOrder = c(1,1)),
distribution="sstd")
Error: ugarchspec-->error: the garch model does not appear to be a valid choice.
我利用了以下事实:文本由空格分隔,偶尔还有(df
.assign(Tweet=df.Tweet.str.lower().str.split(r"[:\s]"))
.explode("Tweet")
.query('Tweet != ""')
.reset_index(drop=True)
)
id author Tweet
0 1 trump rt
1 1 trump @kin2souls
2 1 trump @kimstrassel
3 1 trump anyone
4 1 trump that
5 1 trump votes
或者,您可以使用str extractall-我觉得它有点长:
:
不确定(
df.set_index(["id", "author"])
.Tweet.str.lower()
.str.extractall(r"\s*([a-z@\d]+)[:\s]*")
.droplevel(-1)
.rename(columns={0: "Tweet"})
.reset_index()
)
与正则表达式如何配合使用-也许其他人可以解决该问题