Question

我试图用标点符号来获取文本，因为在我的doc2vec模型中考虑后者非常重要。但是，wikicorpus只检索文本。在搜索网页后，我找到了这些页面：

来自gensim github的页面问题部分。这是一个人的问题，其答案是子类WikiCorpus（由Piskvorky回答）。幸运的是，在同一页面中，有一个代码表示建议的子类＆＃39;解。该代码由Rhazegh提供。（link）
stackoverflow中带有标题的页面：＆＃34;解析wiki语料库时禁用Gensim删除标点符号等＃34;。但是，没有提供明确的答案，并在spaCy的背景下进行了处理。（link）

我决定使用第1页中提供的代码。我当前的代码（mywikicorpus.py）：

import sys
import os
sys.path.append('C:\\Users\\Ghaliamus\\Anaconda2\\envs\\wiki\\Lib\\site-packages\\gensim\\corpora\\')

from wikicorpus import *

def tokenize(content):
    # override original method in wikicorpus.py
    return [token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
        if len(token) <= 15 and not token.startswith('_')]

def process_article(args):
   # override original method in wikicorpus.py
    text, lemmatize, title, pageid = args
    text = filter_wiki(text)
    if lemmatize:
        result = utils.lemmatize(text)
    else:
        result = tokenize(text)
    return result, title, pageid


class MyWikiCorpus(WikiCorpus):
def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
    WikiCorpus.__init__(self, fname, processes, lemmatize, dictionary, filter_namespaces)

    def get_texts(self):
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
        pool = multiprocessing.Pool(self.processes)
        for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
            for tokens, title, pageid in pool.imap(process_article, group):  # chunksize=10):
                articles_all += 1
                positions_all += len(tokens)
            if len(tokens) < ARTICLE_MIN_WORDS or any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                continue
            articles += 1
            positions += len(tokens)
            if self.metadata:
                yield (tokens, (pageid, title))
            else:
                yield tokens
    pool.terminate()

    logger.info(
        "finished iterating over Wikipedia corpus of %i documents with %i positions"
        " (total %i articles, %i positions before pruning articles shorter than %i words)",
        articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
    self.length = articles  # cache corpus length

然后，我使用了Pan Yang的另一个代码（link）。此代码启动WikiCorpus对象并检索文本。我当前代码中唯一的变化是启动MyWikiCorpus而不是WikiCorpus。代码（process_wiki.py）：

from __future__ import print_function
import logging
import os.path
import six
import sys
import mywikicorpus as myModule



if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki-20180601-pages-    articles.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = myModule.MyWikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

通过命令行我运行了process_wiki.py代码。我在命令提示符中输入了最后一行语料库的文本：

（2018-06-05 09：18：16,480：INFO：已完成保存4526191篇文章）

当我在python中读取文件时，我检查了第一篇文章，它没有标点符号。例如：

我的两个相关问题，我希望你能帮助我，请：

上面报告的管道中有什么问题吗？
不管这样的管道，如果我打开gensim wikicorpus python代码（wikicorpus.py）并想编辑它，我应该添加或删除它或更新它（如果可能的话）到什么行？得到相同的结果，但标点符号？

非常感谢你花时间阅读这篇长篇文章。

祝福，

Ghaliamus

Answer 1

问题出在您定义的标记化功能上

def tokenize(content):
    return [token.encode('utf8') for token in utils.tokenize(content, 
            lower=True, errors='ignore') if len(token) <= 15 and not 
            token.startswith('_')]

func utils.tokenize（content，lower = True，errors ='ignore'）只是将文章标记化为标记列表。但是，... / site-packages / gensim / utils.py中此func的实现会忽略标点符号。

例如，当您调用utils.tokenize（“我喜欢吃香蕉，苹果”）时，它返回[“ I”，“ love”，“饮食”，“香蕉”，“苹果”]

无论如何，您可以按照以下定义自己的标记化功能来保留标点符号。

def tokenize(content):
    #override original method in wikicorpus.py
    return [token.encode('utf8') for token in content.split() 
           if len(token) <= 15 and not token.startswith('_')]

Answer 2

在gensim / utils.py中找到方法

def save_as_line_sentence(corpus, filename):
    with smart_open(filename, mode='wb', encoding='utf8') as fout:
        for sentence in corpus:
            line = any2unicode(' '.join(sentence) + '\n')
            fout.write(line)

可用于将语料库写入文本文件。您可以覆盖它或以它为例并编写自己的版本（也许您想在每个标点处换行），例如

def save_sentence_each_line(corpus, filename):
    with utils.smart_open(filename, mode='wb', encoding='utf8') as fout:
        for sentence in corpus:
            line = utils.any2unicode(' '.join(sentence) + '\n')
            line = line.replace('. ', '\n').replace('!', '\n').replace('?', '\n') # <- !!
            ...

您可以这样称呼

save_sentence_each_line(wiki.get_texts(), out_f)

但是您还需要从utils中覆盖PAT_ALPHABETIC，因为那是标点符号被删除的地方：

PAT_ALPHABETIC = re.compile(r'(((?![\d])[\w\\.\\!\\?])+)', re.UNICODE)

然后，您可能需要覆盖 utils.tokenize 和 utils.simple_tokenize ，以防对代码进行进一步更改。

如何使用gensim wikicorpus获取带标点符号的维基百科语料库文本？

2 个答案: