Question

我对python比较陌生。假设我有以下字符串 -

tweet1= 'Check this out!! #ThrowbackTuesday I finally found this!!'
tweet2= 'Man the summer is hot... #RisingSun #SummerIsHere Can't take it..'

现在，我正在尝试删除推文中的所有主题标签（＃），以便 -

tweet1= 'Check this out!!  I finally found this!!'
tweet2= 'Man the summer is hot...  Can't take it..'

我的代码是 -

tweet1= 'Check this out!! #ThrowbackTuesday I finally found this!!'
i,j=0,0
s=tweet1
while i < len(tweet1):
    if tweet1[i]=='#':
        j=i
        while tweet1[j] != ' ':
            ++j
        while i<len(tweet1) and j<len(tweet1):
            ++j
            s[i]=tweet1[j]
            ++i
    ++i
print(s)

这段代码没有输出，也没有错误导致我相信我使用了错误的逻辑。使用正则表达式有更简单的解决方案吗？

Answer 1

您可以使用split和startswith来完成任务。

此处split会使您的tweet字符串成为由空格分隔的单词列表。因此，当在理解中迭代创建新列表时，只需使用#省略以startswith开头的任何内容。然后' '.join将简单地再次用空格分隔它。

代码可以写成

tweet = 'Check this out!! #ThrowbackTuesday I finally found this!!'
print(' '.join([w for w in tweet.split() if not w.startswith('#')]))

输出：

Check this out!! I finally found this!!

Answer 2

这是一个正则表达式解决方案：

re.sub(r'#\w+ ?', '', tweet1)

正则表达式意味着删除一个哈希符号，后跟一个或多个单词字符（字母，数字或下划线），后面跟一个空格（所以你不会连续得到两个空格）。

你可以在Google上找到很多关于正则表达式和Python的内容，但这并不难。

此外，要允许其他特殊字符，例如$和@，请将\w替换为[\w$@]，其中$@可以替换为codename1.android.targetSDKVersion=8你喜欢的字符，即括号中的所有内容都是允许的。

Answer 3

Python没有++运算符，因此++j只将+运算符应用于j两次，当然，什么都不做。您应该使用j += 1代替。

删除不确定的子串

3 个答案: