Question

我从文本文件中获取了数据，目前有一个包含大量网址的列表，其中一些是重复的，以及unix时间戳（制表符分隔）。我想创建一个输出，其中包含每个唯一的url，url发生的次数以及最早发生的时间。这就是数据的样子：

url1     1441076681663   
url2     1441076234873   
url2     1441123894050   
url2     1441432348975   
url3     1441659082347   
url1     1441450392840

我希望这是我的输出，在csv文件中：

url    count    time
url1    2       1441076681663
url2    3       1441076234873
url3    1       1441659082347

我正在考虑使用字典，但我不确定如何用最早的字母代替时间。也许某种for / if循环？

Answer 1

将您的网址设为字典的关键字，因为它始终是唯一的，您可以维护类似

的字典

Dict = {url1 : [mintime, count]} #to track minimum and count

或

Dict = {url1 : [time1, time2, time3]} #to track all timestamps, 
# I would prefer this one if you don't space constraint as you would get more info

第二个数据结构的代码

Dict = {} #empty dictionary

with open("file.txt", "r") as file: #reading file
    for line in file.readlines():
        if len(line) > 0:
               mylist = line.split() #spliting with tab
               key = mylist[0]
               value = mylist[1]
               if key in Dict:
                   Dict[key].append(value) #if url already exists as key
               else:
                    Dict[key] = [value]
        else:
            print "No more lines to render"

    print Dict

Answer 2

这是一个仅使用Python标准库的解决方案。

import csv
from collections import defaultdict

d = defaultdict(list)
with open('input.txt', 'r') as f:
    for line in f.readlines():
        url, timestamp = line.split()
        d[url].append(int(timestamp))

with open('output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['url', 'count', 'time'])
    for url, timestamps in d.items():
        writer.writerow([url, len(timestamps), min(timestamps)])

Answer 3

这是一个Counter对象也可能有用的实例：https://docs.python.org/2/library/collections.html

这是一个实现：

from collections import Counter

# Get list of data
my_list = []
my_list.append(('url1', 1441076681663))
my_list.append(('url2', 1441076234873))
my_list.append(('url2', 1441123894050))
my_list.append(('url2', 1441432348975))
my_list.append(('url3', 1441659082347))
my_list.append(('url1', 1441450392840))

# First get the count
my_counter = Counter([pair[0] for pair in my_list])

# Then find the first instance
my_dict = {}

for pair in my_list:

    key = pair[0]
    val = pair[1]
    if (key not in my_dict) or (my_dict[key] > val):
        my_dict[key] = val

print "URL\tCount\tFirst Instance"
for key in my_dict:
    print key, my_counter[key], my_dict[key]

Answer 4

这是使用hList = ['h','e','l','l','o'] hStr = "Hello" running = False if hList in hStr : running = True print("This matches!")的解决方案。

pandas

使用dict连接项目

4 个答案: