Question

你好亲爱的程序员，社交媒体评论包括许多以使用许多角色为特征的休闲语言。一个例子是：“Helloooooo！”。对于分析，我想删除超过2的这些重复字母，并用完全2个字母替换它们。我们的例子是“你好！”。我找到了相应的正则表达式。但它也将我的行数从500.000减少到450.000。有些行现在包含多条推文，而不只是一条。

虚线示例（以下文本应分为3行，而不是输出文件中的1行：

z .. :)"

"USERNAME Am Wochenende gabs das halt fÃ¼r 10 und das DLC fÃ¼r 2,50. Und da das Guthaben hier rumfliegt.. hab ich zugeschlagen :D"

"Wenn das keine #Leseempfehlung ist! Vielen Dank. :) #krimi #sauerland #lesen #lesetipp #rezension URL

处理代码：

#repeating letters are set to a limit of 2
#errror: Output file loses 50000 columns. WHy?
import re
with open("C:/Users/M/PycharmProjects/Bachelor_Thesis/test/data_feat2.csv","r", encoding="utf-8") as oldfile1, open('data_feat3.csv', 'w',encoding="utf-8") as newfile1:
    for line in oldfile1:
        line=re.sub(r'(.)\1+', r'\1\1', line) 
        newfile1.write(line)
newfile1.close()

Answer 1

可能会有重复的逗号，他们是否逃脱了？在你的csv中搜索它？

要尝试的另一件事是使用csv模块读取文件并分别在每列上运行正则表达式。这会慢得多，但可以帮助你调试。

Python：从推文中删除重复的字母

1 个答案: