Question

我正在开发一个程序，该程序将为我提供xml文件的位置倒排索引的结果。首先，我需要将文档编号的类型从字符串更改为整数，以便以后使用。

我的一些代码如下：

def index(document_directory, dictionary_file, postings_file):
    # preprocess docID list

        docID_list = [int(docID_string) for docID_string in os.listdir(document_directory)]
        docID_list.sort()
        stemmer = PorterStemmer()
        stopwords = nltk.corpus.stopwords.words('english')
        # stopwords = set(stopwords.words('english'))
        docs_indexed = 0    # counter for the number of docs indexed
        dictionary = {}     # key: term, value: docIDs containing term (incudes repeats)
            # for each document in corpus
        for docID in docID_list:
                if (LIMIT and docs_indexed == LIMIT): break
.
.
.
.
.
            # open files for writing   
        dict_file = codecs.open(dictionary_file, 'w', encoding='utf-8')
        post_file = open(postings_file, 'wb')
.
.
.
.
            # close files
        dict_file.close()
        post_file.close()    
.
.
.
.

"""
prints the proper command usage
"""
def print_usage():
    print ("usage: " + sys.argv[0] + "-i directory-of-documents -d dictionary-file -p postings-file")

.
.
.
if (RECORD_TIME): start = timeit.default_timer()                              # start time
index(document_directory, dictionary_file, postings_file)   # call the indexer
if (RECORD_TIME): stop = timeit.default_timer()                               # stop time
if (RECORD_TIME): print ('Indexing time:' + str(stop - start))                # print time taken

现在，当我使用命令运行它时：

$ python def_ind.py -i“ ./index/” -d“ output1111.txt” -p“ output222.txt”

我收到以下错误：

Traceback (most recent call last):
  File "def_ind.py", line 161, in <module>
    index(document_directory, dictionary_file, postings_file)   # call the indexer
  File "def_ind.py", line 36, in index
    docID_list = [int(docID_string) for docID_string in os.listdir(document_directory)]
  File "def_ind.py", line 36, in <listcomp>
    docID_list = [int(docID_string) for docID_string in os.listdir(document_directory)]
ValueError: invalid literal for int() with base 10: '.DS_Store'

我知道有一个不能为int的字符串，但我不知道怎么样？应该在这里做什么？

我正在尝试获得输出，该输出将检查每个单词在每个文档编号中以及在哪一行中出现了几次。例如：（文档编号：找到单词的行号）

  and:
    2: 5,7
    5: 5

flower:
    1: 8
    2: 4,6,8
    3: 6
    4: 6
    5: 6

我的xml文件快照：

    <DOCNO>1</DOCNO>
    <PROFILE>_AN-BENBQAD8FT</PROFILE>
    <DATE>910514
    </DATE>
    <HEADLINE>
    FT  14 MAY 91 / (CORRECTED) Jubilee of a jet that did what it was designed
    to do
    </HEADLINE>
    <TEXT>
       words, words, words
    </TEXT>
    <PUB>The Financial Times
    </PUB>
    <PAGE>
    London Page 7 Photograph (Omitted).
    </PAGE>
    </DOC>`

我正在使用python 3.7。

注意：我发现了许多错误相同的问题，但都不适合我的情况。

Answer 1

函数the documentation返回该特定目录中的文件名。

如您的错误所述，您正在尝试将这些名称转换为整数。这是导致您出错的原因，在此行中：

docID_list = [int(docID_string) for docID_string in os.listdir(document_directory)]

您粘贴的代码是一团糟（也许粘贴到StackOverflow之后只是缩进错误）；我不明白您要在那完成什么。据我所知，您实际上从未使用过列表docID_list的值，而只是对其进行迭代。那么，为什么还要将值转换为int？

ValueError：以10为底的int（）的无效文字：'.DS_Store'

1 个答案: