Question

我正在抓取Twitter的趋势主题目前有这样的列表

            Trending_Topics
             #facebookdown              
             Lena Dunham  
     #SaveThePlanetIn4Words   
     #NationalPunctuationDay     
             Lane Kiffin

我想现在插入＆＃39; +＆＃39;在字符串中的每个单词前面签名

但是，我目前的代码

 df3['Keywords'] = df3.Trending_Topics.str.replace(r'(\b\S)', r'+\1')

放置＆＃39; +＆＃39;在＃标签字符串

后面

 Trending_Topics
 #+facebookdown
 #+SavethePlanetIn4Words
 etc...

理想情况下，我的输出看起来如此

                Trending_Topics
             +#facebookdown              
             +Lena +Dunham  
     +#SaveThePlanetIn4Words   
     +#NationalPunctuationDay     
             +Lane +Kiffin

是否有一个简单的正则表达式解决方案？

Answer 1

你需要使用负面的lookbehind断言。

re.sub(r'(?<!\S)(\S)', r'+\1', st)

(?<!\S)声称匹配不会以任何非空格字符开头。

DEMO

示例：

>>> import re
>>> s = '''             #facebookdown              
             Lena Dunham  
     #SaveThePlanetIn4Words   
     #NationalPunctuationDay     
             Lane Kiffin   '''
>>> print(re.sub(r'(?<!\S)(\S)', r'+\1', s))
             +#facebookdown              
             +Lena +Dunham  
     +#SaveThePlanetIn4Words   
     +#NationalPunctuationDay     
             +Lane +Kiffin

Answer 2

您可以使用：

import re
p = re.compile(ur'(?<=\s|^)(?=\S)', re.MULTILINE)

result = re.sub(p, u"+", input)

RegEx Demo

正则表达式分手：

(?<=\s|^)  # assert if previous position is a space or line start
(?=\S)     # assert if next position is a non-space character

正则表达式在每个单词的开头插入一个字符

2 个答案: