Question

需要一些建议...我有一些推文集

Mon Apr 06 22:19:45 PDT @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. :( You shoulda got David Carr of Third Day to do it. ;D
Mon Apr 06 22:19:49 PDT is upset that he can't update his Facebook by texting it... and might cry as a result :( School today also. Blah!
Mon Apr 06 22:19:53 PDT @Kenichan I dived many times for the ball. Managed to save 50% :( The rest go out of bounds
Mon Apr 06 22:19:57 PDT my whole body feels itchy and like its on fire :(

如何删除这个星期四06月22日22:19:57 PDT？使用正则表达式？

Answer 1

如果这是一个字符串，只需在第一个PDT上分割行：

for line in tweets.splitlines():
    print line.split(' PDT ', 1)[1]

第一次出现的字符PDT（带空格）时会分割该行，并打印结果的后半部分。

但也许你可以改为阻止输出字符串的代码在第一个地方添加日期？

Answer 2

for line in lines:
    print line[24:]

如果日期/时间格式始终相同，

可能很简单。

Answer 3

如果它们是字符串，则所有字符串都以相同的方式存储，只需进行拆分：

tweet = "Mon Apr 06 22:19:57 PDT SomeGuy Im not white enough to be excited for a new version of Windows".

tweet= tweet.split(None, 5)[-1]

推文结果

“SomeGuy我不够白，不能为新版本的内容感到兴奋视窗“

Answer 4

似乎将其拆分为单词列表，并且删除前六个单词更有可能在时区变化中保持一致。

clean_tweets = []

for tweet in tweets:
    words = tweet.split()
    del words[0:5]
    clean_tweet = " ".join(words)
    clean_tweets.append(clean_tweet)

默认情况下，split()将拆分空格，因此您无需指定分隔符。

Answer 5

我假设你不能使用PDT，因为你不能假设他们永远都是PDT。似乎字符串中最容易识别的部分是[0-9] +：[0-9] +：[0-9] + - 时间。

/^.*[0-9]+:[0-9]+:[0-9]+\s+[A-Z]{3}\s*(.*)$/

在时间之后捕获字符串，并在所有大写字母中捕获3个字母的时区。

如何从推文中删除日期？

5 个答案: