Question

我有以下文件包含超过500.000行。这些行如下所示：

0-0 0-1 1-2 1-3 2-4 3-5
0-1 0-2 1-3 2-4 3-5 4-6 5-7 6-7
0-9 1-8 2-14 3-7 5-6 4-7 5-8 6-10 7-11

对于每个元组，第一个数字表示文本a中第n行上的单词的索引，第二个数字表示同一行n但在文本b中的单词的索引。值得指出的是，文本a中的相同单词可能与文本b中的多个单词相关联;与索引0处的行的情况一样，文本a中位置0处的单词连接到文本b中位置0和1处的两个单词。现在我想从上面的行中提取信息，这样很容易检索文本中的哪个单词与文本b中的哪个单词相关联。我所想的是使用字典，如下面的代码所示：

#suppose that I have opened the file as f
for line in f.readlines():
    #I create a dictionary to save my results
    dict_st=dict()
    #I split the line so to get items like '0-0', '0-1', etc.
    items=line.split()  
    for item in align_spl:
        #I split each item at the hyphen so to get the two digits that are now string.
        als=item.split('-')
        #I fill the dictionary
        if dict_st.has_key(int(als[0]))==False:
            dict_st[int(als[0])]=[int(als[1])]
        else: dict_st[int(als[0])].append(int(als[1]))

在提取了与文本之间的文字对应关系的所有信息之后，然后我打印出彼此对齐的单词。现在这个方法很慢;特别是如果我必须从超过500.000句话重复它。我想知道是否有更快的方法来提取这些信息。谢谢。

Answer 1

您好我不确定这是您需要的

如果您需要每行的字典：

for line in f:
    dict_st=dict()
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

如果您需要整个文件的字典：

dict_st={}
for line in f:
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, set()).add(v)

我使用set代替list来阻止重复值。如果您需要这些重复，请使用'list`

dict_st={}
for line in f:
    for item in line.split():
        k, v = map(int, item.split('-'))
        dict_st.setdefault(k, []).append(v)

N.B。一个人可以使用readlines()

在文件中迭代而不在内存中读取它

以更快的方式创建字典 - Python

1 个答案: