Question

如何打开我的txt文件并从txt文件中删除某些推文中的som特殊字符。

我的文字看起来像这样

@xirwinshemmo thanks for the follow :)
hii... if u want to make a new friend just add me on facebook! :) xx https:\/\/t.co\/RCYFVrmdDG
@ycmaP enjoy tmrro. saw them earlier this wk here in tokyo :)

我必须摆脱以@开头的一切和每个网页（http）我该怎么做？

到目前为止我已尝试过这个。

 import re

 a = []
 with open('englishtweets1.txt','r') as inf:
      a = inf.readlines()
for line in a:
    line = re.sub(r['@'], line)

Answer 1

像这样使用

import re
data = open('englishtweets1.txt').read()
new_str = re.sub(r'^@', ' ', data)
new_str = re.sub(r'^https?:\/\/.*[\r\n]*', '', new_str, flags=re.MULTILINE)
#open('removed.txt', 'w').write(new_str) (if needed)

<强>更新这只是刚刚测试过的

new_str = re.sub(r'https.(.*?) ', '', new_str, flags=re.MULTILINE)

Answer 2

一气呵成

如果您的文件不是很大，您可以一次性完成所有操作：

import re
with open('englishtweets1.txt') as f:
    contents = re.sub(r'^@\w+\s|\bhttp[^\s]+', '', f.read(), flags=re.MULTILINE)
print contents

结果：

感谢关注:)
  嘿......如果你想结交新朋友，请在Facebook上添加我！ :) xx
  享受tmrro。今天早些时候在东京看到了他们：）

请注意，http剥离非常简单，并且会删除以http开头的任何内容。要解决此问题，您可以改进正则表达式以搜索有效的http网址。

逐行

如果您的文件非常大，您可能不希望将其全部存储在内存中。您可以改为迭代文件中的所有行：

import re
with open('englishtweets1.txt') as f:
    for line in f:
        print re.sub(r'^@\w+\s|\bhttp[^\s]+', '', line)

从txt文件中删除字符

2 个答案:

一气呵成

逐行