编辑:我有一个文本文件,其中包含一个波斯语句子,一个标签和每行中的د和一个英文单词。我省略了停用词和标点符号,并将结果放在一个列表中(witoutStops)。现在我必须看看“s”中的单词是否在witoutStop列表的每一行中,如果不是那么放“1”而不是“0”。例如,如果列表有10行,则输出文件应该有10行1和0,也有6列(5个用于“s”列表中的单词,1表示英语单词)。但问题是,它返回30行。我该如何解决?
from hazm import*
from collections import Counter
import collections
import math
punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~،؟«؛«'
file1 = "stopwords.txt"
file2 = "test files/golTest.txt"
witoutStops = []
corpuslines = []
def RemStopWords(file1, file2):
with open(file1, encoding = "utf-8") as stopfile:
normalizer = Normalizer()
stopwords = stopfile.read()
stopwords = normalizer.normalize(stopwords)
with open(file2, encoding = "utf-8") as trainfile:
for line in trainfile:
tmp = line.strip().split("\t")
tmp[0] = normalizer.normalize(tmp[0])
for i in punctuation: # delete punctuations
if i in tmp[0]:
tmp[0] = tmp[0].replace(i, "")
corpuslines.append(tmp)
for row in corpuslines:
line = ""
tokens = row[0].split()# delete stop words
for token in tokens:
if token not in stopwords:
line += token + " "
line = line.strip() + "\t" + row[1] + "\n"
witoutStops.append (line)
#print (witoutStops)
#print (corpuslines)
s = ['آبی', 'منابع', 'سبز', 'رنگ', 'زرد']
def vector():
RemStopWords(file1, file2)
for line in witoutStops:
with open ("Train.arff", "a", encoding = "utf-8") as f:
line = line.split("\t")
words = line[0].split()
for i in s:
if any([i == word for word in words]):
f.write('1,')
else:
f.write('0,')
文件示例(和witoutStop列表):
输出应该有5列(“s”列表中的单词)+英文单词的一列。 10行。
提示:这是更大代码的一部分。所以它有一些其他功能来提取“s”列表的单词(实际上是文件中最常用的1000个单词)。我在这里以这5个字为例。
答案 0 :(得分:0)
给定制表符分隔文件:
$ cat test.txt
زرد رنگ سبز منابع آبی Yellow-green color of water resources
زرد رنگ سبز منابع آبی Yellow-green color of water resources
这是阅读文件列的惯用方法:
with open('test.txt', 'r') as fin:
for line in fin:
persian, english = line.strip().split('\t')
# Do something
print (persian)
您还可以将标签分隔的文件作为数据框读取,例如与pandas
:
>>> import pandas as pd
>>> pd.read_csv('test.txt', delimiter='\t', header=None)
0 1
0 زرد رنگ سبز منابع آبی Yellow-green color of water resources
1 زرد رنگ سبز منابع آبی Yellow-green color of water resources
>>> df = pd.read_csv('test.txt', delimiter='\t', header=None)
>>> df.rename(columns={0:'persian', 1:'english'})
persian english
0 زرد رنگ سبز منابع آبی Yellow-green color of water resources
1 زرد رنگ سبز منابع آبی Yellow-green color of water resources
>>> df = df.rename(columns={0:'persian', 1:'english'})
>>> df['persian']
0 زرد رنگ سبز منابع آبی
1 زرد رنگ سبز منابع آبی
Name: persian, dtype: object
>>> print (df['persian'][0])
زرد رنگ سبز منابع آبی
>>> print (df['english'][0])
Yellow-green color of water resources
将列写入新文件:
with open('test.txt', 'r') as fin, open('p.txt', 'w') as pfout, open('e.txt', 'w') as efout:
for line in fin:
persian, english = line.strip().split('\t')
pfout.write(persian+'\n')
efout.write(english+'\n')
如果你想在写入文件之前删除停用词:
stopwords = ['of', 'in', 'the']
with open('test.txt', 'r') as fin, open('p.txt', 'w') as pfout, open('e.txt', 'w') as efout:
for line in fin:
persian, english = line.strip().split('\t')
english_no_stop = [w for w in english.split() if w not in stopwords]
# Concatenate the list into a string
english_no_stop = ' '.join(english_no_stop)
pfout.write(persian+'\n')
efout.write(english_no_stop +'\n')