我通过一系列facebook用户喜欢写了这个迭代。擦洗过程要求代码首先选择一个用户,然后选择一个类似的,然后选择一个类似的字符。如果一个像中的太多字符不是英文字符(在alphanum字符串中)那么就会假设它是乱码并被删除。
此过滤过程将继续通过所有喜欢和所有用户。我知道嵌套循环是不行的,但我没有看到一种没有三重嵌套循环的方法。有什么建议?此外,如果任何人有任何其他效率或传统建议,我很乐意听到它。
def cleaner(likes_path):
'''
estimated run time for 170k users: 3min
this method takes a given csv format datasheet of noisy facebook likes.
data is scrubbed row by row (meaning user by user) removing 'likes' that are not useful
data is parsed into manageable size specified files.
if more data is continuously added method will just keep adding new files
if more data is added at a later time choosing a new folder to put it in would
work best so that the update method can add it to existing counts instead
of starting over
'''
with open(os.path.join(likes_path)) as likes:
dct = [0]
file_num = 0
#initializes naming scheme for self-numbering files
file_size = 30000
#sets file size to 30000 userId's
alphanum = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890 $%@-'
user_count = 0
too_big = 1000
too_long = 30
for rows in likes:
repeat_check = []
user_count += 1
user_likes = make_like_list(rows)
to_check = user_likes[1:]
if len(to_check) < too_big:
#users with more than 1000 likestake up much more resources/time
#and are of less analytical value
for like in to_check:
if len(like) > too_long or len(like) == 0:
#This changes the filter sensitivity. Most useful likes
#are under 30 char long
user_likes.remove(like)
else:
letter_check = sum(1 for letter in like[:5] if letter in alphanum)
if letter_check < len(like[:5])-1:
user_likes.remove(like)
if len(user_likes) > 1 and len(user_likes[0]) == 32:
#userID's are 32 char long, this filters out some mistakes
#filters out users with no likes
scrubbed_to_check = user_likes[1:]
for like in scrubbed_to_check:
if like == 'Facebook' or like == 'YouTube':
#youtube and facebook are very common likes but
#aren't very useful
user_likes.remove(like)
#removes duplicate likes
elif like not in repeat_check:
repeat_check.append(like)
else:
user_likes.remove(like)
scrubbed_rows = '"'+'","'.join(user_likes)+'"\n'
if user_count%file_size == 1:
#This block allows for data to be parsed into
#multiple smaller files
file_num += 1
dct.append(file_num)
dct[file_num] = open(file_write_path + str(file_num) +'.csv', 'w')
if file_num != 1:
dct[file_num-1].close()
dct[file_num].writelines(scrubbed_rows)
if user_counter(user_count, 'Users Scrubbed:', 200000):
break
print 'Total Users Scrubbed:', user_count