Question

我是Python的初学者，我想知道是否有人可以帮助我解决这个问题。

我有一个大文本文件，行数超过600万，但每行只有一对“ x，y”，而x和y的数字相对较小。

我需要做的是在Python中计算文件中每出现两次“ x，y”，然后将它们写在excel文档中，每行代表de“ y”，每一列， “ x”。

我有一个可以运行的程序，但是文件太大，以至于要花一年多的时间才能完成。

所以我想知道是否有一种更快的方法。

请记住，刚开始我确实在编程方面不那么出色。

非常感谢潜在的答案。

到目前为止，这是我的代码：

import xlsxwriter

book = xlsxwriter.Workbook("MyCount.xlsx")

sheet1 = book.add_worksheet('Sheet 1')

sheet1.write(0,0,'y\x')

for i in range (0,1441):
    sheet1.write(0,i+1,i)

for i in range (1,118):
    sheet1.write(i,0,i)

file1=open("Data.txt","r")

count=0

for x in range (0, 1441):
    for y in range (1, 118):
        count=0
        number=f'{x}'+','+f'{y}'+'\n'
        for line in file1.readlines():
            if line == number:
                count+=1
        sheet1.write(y, x+1, count)
        file1.seek(0)

file1.close()
book.close()

Answer 1

所以看看这个：

counts = {}

for line in open("data.txt", "r"):
    line = line.split(',')

    number_1 = None
    number_2 = None

    for line_element in line:

        try:
            number = int(line_element)
            if number_1 is None:
                number_1 = number
            else:
                number_2 = number
        except Exception:
            pass

    if number_1 and number_2:
        numbers_couple = '{},{}'.format(number_1, number_2)

        if numbers_couple in counts:
            counts[numbers_couple] += 1
        else:
            counts[numbers_couple] = 1

print(counts)

我的data.txt内容：

a,b,c,20,30,dad,glaas
fdls,cafd,erer,fdesf,2,4534
fdls,cafd,erer,fdesf,2,11

结果：

{
   '20,30': 1, 
   '2,4534': 1, 
   '2,11': 1
}

您可以使用此结果，通过拆分字典的键以获得x和y，将其写入到新文件中。

所以像这样，我已经计算出文件中的数字对。这是你想要的？请让我知道。

Answer 2

这是Alexandru解决方案的（未试用...）改进版本（nb：当Alexendru发布自己的答案时，我已经在写此答案了，但是由于他是第一位发布者，如果可以帮助您解决问题，请给他功劳）。

总体思路是仅对文件执行一次单次传递，而不是对170038（=> 1441 * 118）进行连续顺序扫描，并将<Buffer 82 a4 6e 61 6d 65 a8 4a 6f 68 6e 20 44 6f 65 a3 61 67 65 0c>调用的次数减少到找到的行数，而不是重写相同的细胞一遍又一遍。

使用函数也将有助于更快地执行，因为局部变量的访问比全局变量的访问快。

不能确定这是否足够快来解决您的问题，但至少应该比当前的实现快。

NB：600万个sheet.write()的字典很容易适合大多数现代计算机的内存（只是在已经很忙的我的计算机上尝试过），所以这不是问题（而且您已经在读取内存中的整个文件了， wrt /内存可能要重得多...）

{(int,int):int}

Answer 3

我认为这对您来说将是更优雅的解决方案。将该文件读入pandas数据框并进行分组并计算成对。

import pandas as pd
d = [(1,2,3),(1,2,4),(1,2,1),(1,1,5),(1,4,5),(1,1,8)]

cntdt = pd.DataFrame(d,columns=['x','y','cnt'])
cntdt.head()

s = cntdt.groupby(['y','x']).size()

#to get the dataframe
s.to_frame('count').reset_index()

#to get the dictionary
s.to_dict()

字典输出：{（1，1）：2，（2，1）：3，（4，1）：1} 数据框输出：

<table border="1" class="dataframe"> <thead>   <tr style="text-align: right;">     <th></th>     <th>y</th>     <th>x</th>     <th>count</th>   </tr> </thead> <tbody>   <tr>     <th>0</th>     <td>1</td>     <td>1</td>     <td>2</td>   </tr>   <tr>     <th>1</th>     <td>2</td>     <td>1</td>     <td>3</td>   </tr>   <tr>     <th>2</th>     <td>4</td>     <td>1</td>     <td>1</td>   </tr> </tbody></table>

如何计算大型文件中每个数字的每次出现次数

3 个答案: