我有数据文件,看起来像这样,
["Arts & Entertainment", "Arts & Entertainment / Animation & Comics", "Arts & Entertainment / Books & Literature", "Arts & Entertainment / Celebrity/Gossip", "Arts & Entertainment / Fine Art", "Arts & Entertainment / Humor", "Arts & Entertainment / Movies", "Arts & Entertainment / Movies / Action", "Arts & Entertainment / Movies / Comedy", "Arts & Entertainment / Movies / Documentary", "Arts & Entertainment / Movies / Drama", "Arts & Entertainment / Movies / Horror", "Arts & Entertainment / Music", "Arts & Entertainment / Music / Alternative Music", "Arts & Entertainment / Music / Blues", "Arts & Entertainment / Music / Christian Music", "Arts & Entertainment / Music / Classic Rock", "Arts & Entertainment / Music / Classical Music", "Arts & Entertainment / Music / Country Music", "Arts & Entertainment / Music / Electronic Dance Music", "Arts & Entertainment / Music / Heavy Metal", "Arts & Entertainment / Music / Pop Music", "Arts & Entertainment / Music / Rap", "Arts & Entertainment / Radio Stations", "Arts & Entertainment / Television", "Arts & Entertainment / Television / Game Show", "Arts & Entertainment / Television / Kids", "Arts & Entertainment / Television / News", "Arts & Entertainment / Television / Reality", "Arts & Entertainment / Television / Science", "Arts & Entertainment / Television / Sitcom", "Arts & Entertainment / Television / Soap Opera", "Arts & Entertainment / Television / Talk Show", "Autos", "Autos / 4-Wheel Drive/SUVs", "Autos / Buying/Selling Cars", "Autos / Certified Pre-Owned", "Autos / Convertible", "Autos / Coupe", "Autos / Crossover", "Autos / Diesel", "Autos / Electric Vehicles", "Autos / Hatchback", "Autos / Hybrid", "Autos / Luxury", "Autos / Maintenance", "Autos / Maintenance / Parts", "Autos / Maintenance / Repair", "Autos / MiniVan", "Autos / Motorcycles", "Autos / Off-Road Vehicles", "Autos / Road-Side Assistance", "Autos / Sedan", "Autos / Trucks", "Autos / Trucks / Pickup", "Autos / Vintage Cars", "Autos / Wagon", "Business & Industry", "Business & Industry / Advertising", "Business & Industry / Agriculture", "Business & Industry / Biotech/Biomedical", "Business & Industry / Business Software", "Business & Industry / Construction", "Business & Industry / Construction / Composites & Plastics", "Business & Industry / Forestry", "Business & Industry / Government", "Business & Industry / Green Solutions", "Business & Industry / Human Resources", "Business & Industry / Logistics", "Business & Industry / Marketing", "Business & Industry / Metals", "Business & Industry / Non-Profit Organizations", "Business & Industry / Power Industry", "Business & Industry / Public Services", "Business & Industry / Public Services / Emergency Services", "Business & Industry / Public Services / Waste Management", "Business & Industry / Purchasing", "Business & Industry / Retail Industry", "Business & Industry / Small Business", "Business & Industry / Telecom", "Career", "Career / Career Planning", "Career / Job Search", "Career / Job Search / Resume Writing/Advice", "Career / Telecommuting", "Career / U.S. Military", "Education", "Education / Business School", "Education / College Education", "Education / College Education / Admissions", "Education / College Education / College Life", "Education / Continuing Education", "Education / Distance Learning", "Education / Financial Aid", "Education / Financial Aid / Scholarships", "Education / Graduate School", "Education / Homeschooling", "Education / Language Learning", "Education / Language Learning / English as a 2nd Language", "Education / Primary Education", "Education / Secondary Education", "Education / Special Education", "Finance & Money", "Finance & Money / Credit/Debt & Loans", "Finance & Money / Day Trading", "Finance & Money / Exchange Traded Funds", "Finance & Money / Financial News", "Finance & Money / Financial Planning", "Finance & Money / Financial Planning / Retirement Planning", "Finance & Money / Financial Planning / Tax Planning", "Finance & Money / Foreign Exchange Trading", "Finance & Money / Hedge Fund", "Finance & Money / Insurance", "Finance & Money / Investing", "Finance & Money / Mutual Funds", "Finance & Money / Options", "Finance & Money / Stocks", "Food & Drink", "Food & Drink / Barbecues & Grilling", "Food & Drink / Beverages", "Food & Drink / Beverages / Cocktails/Beer", "Food & Drink / Beverages / Coffee/Tea", "Food & Drink / Beverages / Wine", "Food & Drink / Cuisine-Specific", "Food & Drink / Cuisine-Specific / American Cusine", "Food & Drink / Cuisine-Specific / Cajun/Creole", "Food & Drink / Cuisine-Specific / Chinese Cuisine", "Food & Drink / Cuisine-Specific / French Cuisine", "Food & Drink / Cuisine-Specific / Italian Food", "Food & Drink / Cuisine-Specific / Japanese Food", "Food & Drink / Cuisine-Specific / Mexican Cuisine", "Food & Drink / Desserts & Baking", "Food & Drink / Health/LowFat Cooking", "Food & Drink / Organic Food", "Food & Drink / Vegetarian", "Health & Fitness", "Health & Fitness / A.D.D.", "Health & Fitness / AIDS/HIV", "Health & Fitness / Allergies", "Health & Fitness / Alternative Medicine", "Health & Fitness / Alzheimer\\'s Disease", "Health & Fitness / Arthritis", "Health & Fitness / Asthma", "Health & Fitness / Autism/PDD", "Health & Fitness / Bipolar Disorder", "Health & Fitness / Brain Tumor", "Health & Fitness / Cancer", "Health & Fitness / Cancer / Breast Cancer", "Health & Fitness / Cancer / Lung Cancer", "Health & Fitness / Cancer / Prostate Cancer", "Health & Fitness / Cholesterol", "Health & Fitness / Chronic Fatigue Syndrome", "Health & Fitness / Chronic Obstructive Pulmonary Disease", "Health & Fitness / Chronic Pain", "Health & Fitness / Cold & Flu", "Health & Fitness / Deafness", "Health & Fitness / Dental Care", "Health & Fitness / Depression", "Health & Fitness / Dermatology", "Health & Fitness / Diabetes", "Health & Fitness / Epilepsy", "Health & Fitness / Exercise", "Health & Fitness / GERD/Acid Reflux", "Health & Fitness / Headaches/Migraines", "Health & Fitness / Heart Disease", "Health & Fitness / Heart Disease / Women\\'s Heart Disease", "Health & Fitness / Hepatitis", "Health & Fitness / Herbs for Health", "Health & Fitness / Holistic Healing", "Health & Fitness / Hypertension", "Health & Fitness / IBS/Crohn\\'s Disease", "Health & Fitness / Incest/Abuse Support", "Health & Fitness / Incontinence", "Health & Fitness / Infertility", "Health & Fitness / Men\\'s Health", "Health & Fitness / Nursing", "Health & Fitness / Nutrition", "Health & Fitness / Orthopedics", "Health & Fitness / Orthopedics / Sports Medicine", "Health & Fitness / Panic/Anxiety Disorders", "Health & Fitness / Pediatrics", "Health & Fitness / Pharmaceutical", "Health & Fitness / Physical Therapy", "Health & Fitness / Psychology/Psychiatry", "Health & Fitness / Senior Health", "Health & Fitness / Sexuality", "Health & Fitness / Sleep Disorders", "Health & Fitness / Smoking Cessation", "Health & Fitness / Substance Abuse", "Health & Fitness / Substance Abuse / Alcoholism", "Health & Fitness / Thyroid Disease", "Health & Fitness / Weight Loss", "Health & Fitness / Women\\'s Health", "Hobbies & Games", "Hobbies & Games / Arts & Crafts", "Hobbies & Games / Arts & Crafts / Beadwork", "Hobbies & Games / Arts & Crafts / Drawing/Sketching", "Hobbies & Games / Arts & Crafts / Needlework", "Hobbies & Games / Arts & Crafts / Painting", "Hobbies & Games / Arts & Crafts / Photography", "Hobbies & Games / Arts & Crafts / Woodworking", "Hobbies & Games / Astrology", "Hobbies & Games / Birdwatching", "Hobbies & Games / BoardGames/Puzzles", "Hobbies & Games / Candle & Soap Making", "Hobbies & Games / Card Games", "Hobbies & Games / Chess", "Hobbies & Games / Cigars", "Hobbies & Games / Collecting", "Hobbies & Games / Collecting / Antiques", "Hobbies & Games / Collecting / Book Collecting", "Hobbies & Games / Collecting / Miniatures", "Hobbies & Games / Collecting / Stamps & Coins", "Hobbies & Games / Creative Writing", "Hobbies & Games / Getting Published", "Hobbies & Games / Home Recording", "Hobbies & Games / Inventors & Patents", "Hobbies & Games / Learning a Musical Instrument", "Hobbies & Games / Learning a Musical Instrument / Guitar", "Hobbies & Games / Magic & Illusion", "Hobbies & Games / Paranormal Phenomena", "Hobbies & Games / Sci-Fi & Fantasy", "Hobbies & Games / Video Games", "Hobbies & Games / Video Games / Nintendo", "Hobbies & Games / Video Games / PSP", "Hobbies & Games / Video Games / Playstation", "Hobbies & Games / Video Games / RPG", "Hobbies & Games / Video Games / Racing", "Hobbies & Games / Video Games / X-Box", "Home & Garden", "Home & Garden / Appliances", "Home & Garden / Environmental Safety", "Home & Garden / Gardening/Landscaping", "Home & Garden / Home Repair", "Home & Garden / Interior Decorating", "News & Current Affairs", "News & Current Affairs / Law & Politics", "News & Current Affairs / Law & Politics / Immigration", "News & Current Affairs / Law & Politics / Legal Issues", "News & Current Affairs / Law & Politics / U.S. Government Resources", "Parenting & Family", "Parenting & Family / Adoption", "Parenting & Family / Babies & Toddlers", "Parenting & Family / Daycare/Pre-School", "Parenting & Family / Parenting Children", "Parenting & Family / Parenting Teens", "Parenting & Family / Pregnancy", "Parenting & Family / Special Needs Kids", "Pets", "Pets / Aquariums", "Pets / Cats", "Pets / Dogs", "Pets / Veterinary Medicine", "Real Estate", "Real Estate / Apartments", "Real Estate / Architecture", "Real Estate / Buying/Selling Homes", "Religion", "Religion / Alternative Religions", "Religion / Atheism/Agnosticism", "Religion / Buddhism", "Religion / Catholicism", "Religion / Christianity", "Religion / Hinduism", "Religion / Islam", "Religion / Judaism", "Religion / Latter-Day Saints", "Religion / Pagan/Wiccan", "Science", "Science / Astronomy", "Science / Biology", "Science / Chemistry", "Science / Geology", "Science / Physics", "Sensitive Content", "Sensitive Content / Gambling", "Sensitive Content / Gambling / Sports Gambling", "Society", "Society / Dating", "Society / Divorce", "Society / Gay Life", "Society / Marriage", "Society / Senior Living", "Society / Weddings", "Sports & Recreation", "Sports & Recreation / Auto Racing", "Sports & Recreation / Auto Racing / NASCAR Racing", "Sports & Recreation / Baseball", "Sports & Recreation / Basketball", "Sports & Recreation / Bicycling", "Sports & Recreation / Bicycling / Mountain Biking", "Sports & Recreation / Bodybuilding", "Sports & Recreation / Boxing", "Sports & Recreation / Canoeing/Kayaking", "Sports & Recreation / Cheerleading", "Sports & Recreation / Climbing", "Sports & Recreation / College Sports", "Sports & Recreation / Cricket", "Sports & Recreation / Figure Skating", "Sports & Recreation / Fishing", "Sports & Recreation / Fishing / Fly Fishing", "Sports & Recreation / Fishing / Freshwater Fishing", "Sports & Recreation / Fishing / Game & Fish", "Sports & Recreation / Fishing / Saltwater Fishing", "Sports & Recreation / Football", "Sports & Recreation / Golf", "Sports & Recreation / Horses", "Sports & Recreation / Horses / Horse Racing", "Sports & Recreation / Hunting/Shooting", "Sports & Recreation / Ice Hockey", "Sports & Recreation / Inline Skating", "Sports & Recreation / Martial Arts", "Sports & Recreation / Olympics", "Sports & Recreation / Paintball", "Sports & Recreation / Rodeo", "Sports & Recreation / Rugby", "Sports & Recreation / Running/Walking", "Sports & Recreation / Sailing", "Sports & Recreation / Scuba Diving", "Sports & Recreation / Skateboarding", "Sports & Recreation / Skiing", "Sports & Recreation / Snowboarding", "Sports & Recreation / Soccer", "Sports & Recreation / Surfing/Bodyboarding", "Sports & Recreation / Swimming", "Sports & Recreation / Table Tennis/Ping-Pong", "Sports & Recreation / Tennis", "Sports & Recreation / Volleyball", "Sports & Recreation / Waterski/Wakeboard", "Sports & Recreation / Yachting", "Style & Fashion", "Style & Fashion / Body Art", "Style & Fashion / Cosmetics", "Style & Fashion / Fashion", "Style & Fashion / Jewelry", "Technology & Computing", "Technology & Computing / Cameras & Camcorders", "Technology & Computing / Cell Phones", "Technology & Computing / Computer Certification", "Technology & Computing / Computer Networking", "Technology & Computing / Computer Peripherals", "Technology & Computing / Computer Security", "Technology & Computing / Computer Security / Antivirus Software", "Technology & Computing / Computer Security / Network Security", "Technology & Computing / Databases", "Technology & Computing / Graphics", "Technology & Computing / Graphics / 3-D Graphics", "Technology & Computing / Graphics / Animation", "Technology & Computing / Graphics / Desktop Publishing", "Technology & Computing / Graphics / Desktop Video", "Technology & Computing / Graphics / Web Design/HTML", "Technology & Computing / Home Theater Systems", "Technology & Computing / Operating Systems", "Technology & Computing / Operating Systems / Linux", "Technology & Computing / Operating Systems / Mac OS", "Technology & Computing / Operating Systems / Unix", "Technology & Computing / Operating Systems / Windows", "Technology & Computing / Portable Device", "Technology & Computing / Programming", "Technology & Computing / Programming / C/C++", "Technology & Computing / Programming / Java", "Technology & Computing / Programming / JavaScript", "Technology & Computing / Programming / Visual Basic", "Travel", "Travel / Adventure Travel", "Travel / Africa", "Travel / Air Travel", "Travel / Asia", "Travel / Asia / Japan", "Travel / Australia & New Zealand", "Travel / Bed & Breakfasts", "Travel / Budget Travel", "Travel / Business Travel", "Travel / Camping", "Travel / Canada", "Travel / Caribbean", "Travel / Cruises", "Travel / Europe", "Travel / Europe / Eastern Europe", "Travel / Europe / France", "Travel / Europe / Greece", "Travel / Europe / Italy", "Travel / Europe / United Kingdom", "Travel / Honeymoons/Getaways", "Travel / Hotels", "Travel / Mexico & Central America", "Travel / National Parks", "Travel / South America", "Travel / Spas", "Travel / Theme Parks", "Travel / United States", "Travel / United States / California", "Travel / United States / Florida", "Travel / United States / Hawaii", "Travel / United States / Las Vegas, Nevada", "Travel / United States / Manhattan, New York", "Travel / United States / New England", "Travel / United States / Texas", "Travel / Weather"]
我清理数据文件并将其拆分,以便它看起来像这样,
['Arts & Entertainment']
['Arts & Entertainment', 'Animation & Comics']
['Arts & Entertainment', 'Books & Literature']
['Arts & Entertainment', 'Celebrity Gossip']
['Arts & Entertainment', 'Fine Art']
['Arts & Entertainment', 'Humor']
['Arts & Entertainment', 'Movies']
['Arts & Entertainment', 'Movies', 'Action']
['Arts & Entertainment', 'Movies', 'Comedy']
['Arts & Entertainment', 'Movies', 'Documentary']
['Arts & Entertainment', 'Movies', 'Drama']
['Arts & Entertainment', 'Movies', 'Horror']
['Arts & Entertainment', 'Music']
['Arts & Entertainment', 'Music', 'Alternative Music']
['Arts & Entertainment', 'Music', 'Blues']
['Arts & Entertainment', 'Music', 'Christian Music']
['Arts & Entertainment', 'Music', 'Classic Rock']
['Arts & Entertainment', 'Music', 'Classical Music']
['Arts & Entertainment', 'Music', 'Country Music']
['Arts & Entertainment', 'Music', 'Electronic Dance Music']
['Arts & Entertainment', 'Music', 'Heavy Metal']
['Arts & Entertainment', 'Music', 'Pop Music']
['Arts & Entertainment', 'Music', 'Rap']
['Arts & Entertainment', 'Radio Stations']
['Arts & Entertainment', 'Television']
['Arts & Entertainment', 'Television', 'Game Show']
['Arts & Entertainment', 'Television', 'Kids']
['Arts & Entertainment', 'Television', 'News']
['Arts & Entertainment', 'Television', 'Reality']
['Arts & Entertainment', 'Television', 'Science']
['Arts & Entertainment', 'Television', 'Sitcom']
['Arts & Entertainment', 'Television', 'Soap Opera']
['Arts & Entertainment', 'Television', 'Talk Show']...
现在,我正在尝试将列表对象转换为看起来像这样的字典,
{
"Arts & Entertainment": {
"Animation & Comics": {},
"Books & Literature": {},
"Celebrity Gossip": {},
"Fine Art": {},
"Humor": {},
"Movies": {
"Horror": {},
"Action": {},
"Comedy": {}, ...
}, ...
}
问题是我无法弄清楚如何不覆盖我的子类别。在上面的例子中,Movies子键有三个类别,但是当我运行我的代码时,它下面只有它的关键字其中的“恐怖”是因为恐怖是该类别中最后一个列表中最后一个元素的最后一个元素。 我得到的例子:
{
"Arts & Entertainment": {
"Animation & Comics": {},
"Books & Literature": {},
"Celebrity Gossip": {},
"Fine Art": {},
"Humor": {},
"Movies": {
"Horror": {} # notice there are no other categories in the movies section
}, ...
}
我试过的代码:
def cleanup_contextweb():
contextweb_file_path = directory_path + raw_file_names[1]
tree = {}
with open(contextweb_file_path, 'r') as contextweb_file:
cats = contextweb_file.read().replace('Manhattan, New York', 'Manhattan New York').replace('Las Vegas, Nevada', 'Las Vegas Nevada').replace('Celebrity/Gossip', 'Celebrity Gossip').replace('Atheism/Agnosticism', 'Atheism Agnosticism').replace('Pagan/Wiccan', 'Pagan Wiccan').split(',')
#cats = re.sub(r'"|\[|\]', '', cats)
cats = [map(str.strip, re.sub(r'"|\[|\]', '', cat).split('/')) for cat in cats]
cats = sorted(cats)
for cat in cats:
if len(cat) == 1:
tree[cat[0]] = {}
elif len(cat) == 2:
tree[cat[0]][cat[1]] = {}
elif len(cat) == 3:
tree[cat[0]][cat[1]] = {}
tree[cat[0]][cat[1]][cat[2]] = {}
elif len(cat) == 4:
tree[cat[0]][cat[1]] = {}
tree[cat[0]][cat[1]][cat[2]] = {}
tree[cat[0]][cat[1]][cat[2]][cat[3]] = {}
with open(directory_path + 'cleaned_' + raw_file_names[1], 'w') as contextweb_file_out:
json.dump(tree, contextweb_file_out, sort_keys=True, indent=4)
return json.dumps(tree, sort_keys=True, indent=4)
正如你将看到我正在尝试构建字典时,我知道有多深(我需要多少个键)我基于传入的列表的长度。其他的东西,我已经尝试过但是已经删除了, include,按子列表的长度对列表列表(cats
)进行排序并将其反转,以便首先迭代所有包含4个元素的列表。我以为我可以用这种方式构建密钥,因为密钥存在于较低级别。这没有什么帮助。
答案 0 :(得分:6)
实际上,for循环也可以产生一个很好的解决方案:
>>> data
[['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 's', 'd'], ['a', 'b', 'c', 'd', 'e']]
>>> tree = {}
>>> for cats in data:
... curtree = tree
... for c in cats:
... curtree = curtree.setdefault(c, {})
...
>>> tree
{'a': {'s': {'d': {}}, 'b': {'c': {'d': {'e': {}}}}}}
.setdefault()
方法确保当且仅当之前不存在密钥(类别)时才添加子字典。
curtree
从基本字典tree
开始,并使用类别遍历/构建树。
答案 1 :(得分:5)
这是递归的样子:
data = [
['Arts & Entertainment'],
['Arts & Entertainment', 'Animation & Comics'],
..., # full data list elided for readability
['Arts & Entertainment', 'Television', 'Talk Show']
]
def classify(in_list):
sub_dict = {}
label_set = set([category[0] for category in in_list])
for label in label_set:
# print label
sub_category = [sub[1:] for sub in in_list if sub[0] == label and len(sub) > 1]
# print sub_category
sub_dict[label] = classify(sub_category)
return sub_dict
print classify(data)
输出(我没有为可读性而格式化):
{'Arts & Entertainment': {'Celebrity Gossip': {}, 'Humor': {}, 'Television': {'Game Show': {}, 'Kids': {}, 'Science': {}, 'Talk Show': {}, 'Sitcom': {}, 'Reality': {}, 'Soap Opera': {}, 'News': {}}, 'Animation & Comics': {}, 'Movies': {'Action': {}, 'Drama': {}, 'Horror': {}, 'Comedy': {}, 'Documentary': {}}, 'Radio Stations': {}, 'Music': {'Alternative Music': {}, 'Christian Music': {}, 'Electronic Dance Music': {}, 'Pop Music': {}, 'Country Music': {}, 'Classical Music': {}, 'Rap': {}, 'Heavy Metal': {}, 'Blues': {}, 'Classic Rock': {}}, 'Fine Art': {}, 'Books & Literature': {}}}