Question

问题：从Twitter文本中删除^&*$ etc之类的超链接，数字和符号。推文文件采用CSV格式，如下所示：

s.No.   username   tweetText

1.      @abc  This is a test #abc example.com
2.      @bcd  This is another test #bcd example.com

作为python的新手，我搜索并将以下代码串在一起，这要归功于给定here的代码：

import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
    data=myfile.read().lower() # read the file and convert all text to lowercase
    clean_data=' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"

它执行数据剥离，但输出文件格式不是我想要的。输出文本文件在一行中，如

s.no username tweetText 1 abc这是一条经过简化的推文2 bcd这是另一条已清理的推文3 efg这是另一条已发送的推文

如何修复此代码以便为我提供如下所示的输出：

s.No. username  tweetText

1  abc  This is a test

2  bcd  This is another test

3  efg  This is yet another test

我认为需要在正则表达式代码中添加一些内容，但我不知道它可能是什么。任何指针或建议都会有所帮助。

Answer 1

你可以读取该行，清理它，并在一个循环中写出来。您还可以使用CSV模块来帮助您构建结果文件。

import csv
import re

exp = r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"

def cleaner(row):
   return [re.sub(exp, " ", item.lower()) for item in row]

with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
    reader = csv.reader(i, delimiter=',')  # Comma is the default
    writer = csv.writer(o, delimiter=',')

    # Take the first row from the input file (the header)
    # and write it to the output file

    writer.writerow(next(reader))

    for row in reader:
        writer.writerow(cleaner(row))

csv模块正确地知道如何在项之间添加分隔符;只要你传递一个项目集合。

那么，cleaner方法从输入文件中获取行中每个项目（列）的内容，将替换应用于项目的小写版本;然后返回一个列表。

其余代码只是打开文件，使用所需的分隔符为输入和输出文件配置CSV模块（在示例代码中，两个文件的分隔符都是一个选项卡，但您可以更改输出分离器）。

接下来，读取输入文件的第一行并将其写入输出文件。没有对此行进行转换（这就是为什么它不在循环中）。

从输入文件中读取行会自动将文件指针放在下一行 - 所以我们只需循环输入行（在阅读器中），为每行应用清理函数 - 这将返回一个列表 - 然后使用writer.writerow()将该列表写回输出文件。

Answer 2

而不是立即将re.sub（）和.lower（）表达式应用于整个文件，尝试迭代CSV文件中的每一行，如下所示：

for line in myfile:
    line = line.lower()
    line = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
    fileout.write(line+'\n')

当您使用with <file> as myfile表达式时，无需在程序结束时将其关闭，这在您使用

时自动完成

Answer 3

试试这个正则表达式： List<String> testList = Arrays.asList( "apple", "banana", "cat", "dog" ); int count = 0; testList.forEach( test -> { count++; // compilation Error : Local variable count defined in an enclosing scope must be final or effectively final }); for( String test: testList ) { count++; // No Error }

说明：

clean_data=' '.join(re.sub("[@\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text匹配字符，您要替换
[@\^&\*\$]匹配哈希标记
#\S+匹配域名

如果\S+[a-z0-9]\.(com|net|org)无法识别网址，则必须填写潜在TLD列表。

Demo

如何以表格格式将干净数据写入python中的文件

3 个答案: