Question

我想让文本文件包含许多阿拉伯语单词所以我想通过以下方式在python中打开网站：urlopen函数并将单词保存在列表中然后将其导出到文本文件中。我是python的新手任何帮助我都会感激

Answer 1

从网络上保存文件：

import urllib2

u = urllib2.urlopen('http://www.your-url-here.com/filename.txt')
f = open('myfile.txt', 'w')
f.write(u.read())
f.close()

Answer 2

执行以下操作：

从包含文字
清除html标签和符号
提取词语。
过滤掉噪音

对于第2点和第3点，您可以使用nltk。以下是如何实施的示例：

import nltk
import urllib2
u = urllib2.urlopen('http://www.google.com')# replace google with your arabic site of interest
UnwantedSymbols='|&;.,-!'#real words don't contain these symbols, add yours
html=u.read()
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
filename='arabicwords.txt'
f=open(filename,'w')
for token in tokens:
    write=True
    for symbol in UnwantedSymbols:
        if symbol in token:
            write=False
            break
    if write:
        f.write(token+'\n')# if no unwanted symbol was encountered, then write the word to the file
f.close()

写一个从网站到txt文件的单词列表

2 个答案: