Question

编辑：我有一个文本文件，其中包含一个波斯语句子，一个标签和每行中的د和一个英文单词。我省略了停用词和标点符号，并将结果放在一个列表中（witoutStops）。现在我必须看看“s”中的单词是否在witoutStop列表的每一行中，如果不是那么放“1”而不是“0”。例如，如果列表有10行，则输出文件应该有10行1和0，也有6列（5个用于“s”列表中的单词，1表示英语单词）。但问题是，它返回30行。我该如何解决？

from hazm import*
from collections import Counter
import collections
import math

punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~،؟«؛«'

file1 = "stopwords.txt"
file2 = "test files/golTest.txt"


witoutStops = []
corpuslines = []

def RemStopWords(file1, file2):  
    with open(file1, encoding = "utf-8") as stopfile:
        normalizer = Normalizer()
        stopwords = stopfile.read()
        stopwords = normalizer.normalize(stopwords)
        with open(file2, encoding = "utf-8") as trainfile:
            for line in trainfile:
                tmp = line.strip().split("\t")
                tmp[0] = normalizer.normalize(tmp[0])
                for i in punctuation:  # delete punctuations
                    if i in tmp[0]:
                        tmp[0] = tmp[0].replace(i, "")
                corpuslines.append(tmp)
                for row in corpuslines:
                    line = ""
                    tokens = row[0].split()# delete stop words
                    for token in tokens: 
                        if token not in stopwords:
                            line += token + " "
                line = line.strip() + "\t" + row[1] + "\n"
                witoutStops.append (line)
#print (witoutStops)
#print (corpuslines)

s = ['آبی', 'منابع', 'سبز', 'رنگ', 'زرد']

def vector():
RemStopWords(file1, file2)
for line in witoutStops:
    with open ("Train.arff", "a", encoding = "utf-8") as f:   
    line = line.split("\t")
    words = line[0].split()
    for i in s:
        if any([i == word for word in words]): 
            f.write('1,')
        else: 
            f.write('0,')

文件示例（和witoutStop列表）：

输出应该有5列（“s”列表中的单词）+英文单词的一列。 10行。

提示：这是更大代码的一部分。所以它有一些其他功能来提取“s”列表的单词（实际上是文件中最常用的1000个单词）。我在这里以这5个字为例。

Answer 1

给定制表符分隔文件：

$ cat test.txt 
زرد رنگ سبز منابع آبی   Yellow-green color of water resources
زرد رنگ سبز منابع آبی   Yellow-green color of water resources

这是阅读文件列的惯用方法：

with open('test.txt', 'r') as fin:
    for line in fin:
        persian, english = line.strip().split('\t')
        # Do something
        print (persian)

您还可以将标签分隔的文件作为数据框读取，例如与pandas：

>>> import pandas as pd
>>> pd.read_csv('test.txt', delimiter='\t', header=None)
                       0                                      1
0  زرد رنگ سبز منابع آبی  Yellow-green color of water resources
1  زرد رنگ سبز منابع آبی  Yellow-green color of water resources
>>> df = pd.read_csv('test.txt', delimiter='\t', header=None)
>>> df.rename(columns={0:'persian', 1:'english'})
                 persian                                english
0  زرد رنگ سبز منابع آبی  Yellow-green color of water resources
1  زرد رنگ سبز منابع آبی  Yellow-green color of water resources
>>> df = df.rename(columns={0:'persian', 1:'english'})
>>> df['persian']
0    زرد رنگ سبز منابع آبی
1    زرد رنگ سبز منابع آبی
Name: persian, dtype: object
>>> print (df['persian'][0])
زرد رنگ سبز منابع آبی
>>> print (df['english'][0])
Yellow-green color of water resources

将列写入新文件：

with open('test.txt', 'r') as fin, open('p.txt', 'w') as pfout, open('e.txt', 'w') as efout:
    for line in fin:
        persian, english = line.strip().split('\t')
        pfout.write(persian+'\n')
        efout.write(english+'\n')

如果你想在写入文件之前删除停用词：

stopwords = ['of', 'in', 'the']
with open('test.txt', 'r') as fin, open('p.txt', 'w') as pfout, open('e.txt', 'w') as efout:
    for line in fin:
        persian, english = line.strip().split('\t')
        english_no_stop = [w for w in english.split() if w not in stopwords]
        # Concatenate the list into a string
        english_no_stop = ' '.join(english_no_stop)
        pfout.write(persian+'\n')
        efout.write(english_no_stop +'\n')

如何读取列表，以便打印出更多的确切行数？

1 个答案: