我从文本文件中获取了数据,目前有一个包含大量网址的列表,其中一些是重复的,以及unix时间戳(制表符分隔)。我想创建一个输出,其中包含每个唯一的url,url发生的次数以及最早发生的时间。这就是数据的样子:
url1 1441076681663
url2 1441076234873
url2 1441123894050
url2 1441432348975
url3 1441659082347
url1 1441450392840
我希望这是我的输出,在csv文件中:
url count time
url1 2 1441076681663
url2 3 1441076234873
url3 1 1441659082347
我正在考虑使用字典,但我不确定如何用最早的字母代替时间。也许某种for / if循环?
答案 0 :(得分:0)
将您的网址设为字典的关键字,因为它始终是唯一的,您可以维护类似
的字典Dict = {url1 : [mintime, count]} #to track minimum and count
或
Dict = {url1 : [time1, time2, time3]} #to track all timestamps,
# I would prefer this one if you don't space constraint as you would get more info
第二个数据结构的代码
Dict = {} #empty dictionary
with open("file.txt", "r") as file: #reading file
for line in file.readlines():
if len(line) > 0:
mylist = line.split() #spliting with tab
key = mylist[0]
value = mylist[1]
if key in Dict:
Dict[key].append(value) #if url already exists as key
else:
Dict[key] = [value]
else:
print "No more lines to render"
print Dict
答案 1 :(得分:0)
这是一个仅使用Python标准库的解决方案。
import csv
from collections import defaultdict
d = defaultdict(list)
with open('input.txt', 'r') as f:
for line in f.readlines():
url, timestamp = line.split()
d[url].append(int(timestamp))
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['url', 'count', 'time'])
for url, timestamps in d.items():
writer.writerow([url, len(timestamps), min(timestamps)])
答案 2 :(得分:-1)
这是一个Counter对象也可能有用的实例:https://docs.python.org/2/library/collections.html
这是一个实现:
from collections import Counter
# Get list of data
my_list = []
my_list.append(('url1', 1441076681663))
my_list.append(('url2', 1441076234873))
my_list.append(('url2', 1441123894050))
my_list.append(('url2', 1441432348975))
my_list.append(('url3', 1441659082347))
my_list.append(('url1', 1441450392840))
# First get the count
my_counter = Counter([pair[0] for pair in my_list])
# Then find the first instance
my_dict = {}
for pair in my_list:
key = pair[0]
val = pair[1]
if (key not in my_dict) or (my_dict[key] > val):
my_dict[key] = val
print "URL\tCount\tFirst Instance"
for key in my_dict:
print key, my_counter[key], my_dict[key]
答案 3 :(得分:-1)
这是使用hList = ['h','e','l','l','o']
hStr = "Hello"
running = False
if hList in hStr :
running = True
print("This matches!")
的解决方案。
pandas