Question

我有以下字符串：

"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"

我收集了许多类似的推文，并将它们分配给一个数据框。如何通过删除“ hhhhhhhhhhhhhhhhhh”清除数据帧中的那些行，而只保留该行中的其余字符串？

稍后我还将使用countVectorizer，因此很多词汇包含'hhhhhhhhhhhhhhhhhhhhhh'

Answer 1

使用正则表达式。

例如：

import pandas as pd

df = pd.DataFrame({"Col": ["hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh", "Hello World"]})
#df["Col"] = df["Col"].str.replace(r"\b(.)\1+\b", "")
df["Col"] = df["Col"].str.replace(r"\s+(.)\1+\b", "").str.strip()
print(df)

输出：

                                             Col
0  hello, I'm going to eat to the fullest today 
1                                    Hello World

Answer 2

您可以尝试以下方法：

df["Col"] = df["Col"].str.replace(u"h{4,}", "")

在我的案例4中，您可以在哪里设置要匹配的字符数。

                                        Col
0  hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1                               Hello World
                     Col
0  hello, I'm today hh  
1            Hello World

由于您提到自己在推文中，因此我使用了unicode匹配。

如何删除数据框中的重复字母？

2 个答案: