从数百个csv文件创建数据帧的最快方法

时间:2018-02-14 16:48:28

标签: python pandas

我有150个csv文件,有两列(时间和网站)。我想阅读每个文件,创建频率字典({'网站':[site_number,网站出现次数]}),创建DataFrame包含11列(user_id,site1,site2,... site10) ,user_id从文件名解析(../user0001.csv)。 DataFrame中的每一行都有10个站点访问的唯一会话。我的代码150个文件150秒(非常糟糕)。我该如何改进呢?

def prepare_3(path_to_csv_files, session_length=10):
  word_freq = {}
  freq_dict = {}
  word_count = 0
  row = []

  columns = []
  columns.append('user_id')
  columns.extend(['site' + str(i) for i in range(1, session_length+1)])

  lst_files = sorted(glob(path_to_csv_files))

  for csv in lst_files:
    user = int(csv[csv.find('.')-4:csv.find('.')])
    frame = []
    frame.append(user) 
    site_count = 0

    with open(csv, 'r') as f:
      f.readline()
      for line in f:
        site = line[line.find(',') + 1:].rstrip()
        site_count += 1

        if site in word_freq:
          word_freq[site][1] += 1
        else:
          word_count += 1
          word_freq[site] = [word_count, 1]

        if site_count > session_length:
          site_count = 1
          row.append(frame)
          frame = []
          frame.append(user) 
          frame.append(word_freq[site][0])
        else:
          frame.append(word_freq[site][0])

    row.append(frame)
  df = pd.DataFrame(data=row, columns=columns, dtype=int)
  df.fillna(0 ,inplace=True)
  return df, word_freq

0 个答案:

没有答案