Question

我正在做一些讨价还价的烹饪＆＃39;竞争，我想改善不同成分的计数。我意识到处理像黑胡椒粉这样的原始字符串是愚蠢的。和黑胡椒＆＃39;不同，所以我想比较2个列表。

原始/原始清单，包含任意长的成分字符串，如黑胡椒＆＃39;
已处理的成分列表，其中所有＆gt; = 3个单词成分字符串被转换为带有＆＃39; ground black＆＃39;和黑胡椒＆＃39;

成分|算在原始|在处理中计数|差异

...

是否有Counter对象，可以有效地（最好是O（n））计算2个列表中的字符串项的出现次数。

这是我的代码：

with open('data/train.json', 'r') as f:
    train_json = f.read()

stop = stopwords.words('english') + list(string.punctuation)

trainset = json.loads(train_json)
# read the data in and export the ingredient column into a list
df = pd.read_json('data/train.json')
all_recipes = df.ingredients.tolist()

# extract all ingridents from all cuisines into one flat list
all_ingr_raw = [ingr.lower() for recipe in all_recipes for ingr in recipe]

# compare 2 lists for most common ingredients - raw unprocessed ingredients vs processed, where
# processed is a 1-2 word ingredients or ngrams of >=3 word ingredients

# take all ingredient strings that have more than 2 whitespaces in them i.e. 3 words
three_word_ingr = [ingr for ingr in all_ingr_raw if len(ingr.split()) > 2]

# make a list of sublists of bigram tuples for each string from above 
raw_three_word_ngrams = [list(ngrams(phrase.split(),2)) for phrase in three_word_ingr]
# turn tuples into strings and flatten the list
proc_three_word_ngrams = [' '.join(pair) for sublist in raw_three_word_ngrams for pair in sublist]
# join all ingredient strings of 2 words or less with the flat list of all bigrams out of >3 word strings
all_ingr_ngrams = [ingr for ingr in all_ingr_raw if len(ingr.split()) <= 2] + proc_three_word_ngrams

# return a sorted (descending in count) set of tuples (ingredient, count)
count_ingr_raw = Counter(all_ingr_raw).most_common()   
count_ingr_ngrams = Counter(all_ingr_ngrams).most_common()

common = [x for x in count_ingr_raw if x[0] in count_ingr_ngrams]
unique_raw = [x for x in count_ingr_raw if x[0] not in count_ingr_ngrams]
unique_proc = [x for x in count_ingr_ngrams if x[0] not in count_ingr_raw]

# find 
print common[:20]
print unique_raw[:20]
print unique_proc[:20]

[]
[(u'salt', 18049), (u'olive oil', 7972), (u'onions', 7972), (u'water', 7457), (u'garlic', 7380), (u'sugar', 6434), (u'garlic cloves', 6237), (u'butter', 4848), (u'ground black pepper', 4785), (u'all-purpose flour', 4632), (u'pepper', 4438), (u'vegetable oil', 4385), (u'eggs', 3388), (u'soy sauce', 3296), (u'kosher salt', 3113), (u'green onions', 3078), (u'tomatoes', 3058), (u'large eggs', 2948), (u'carrots', 2814), (u'unsalted butter', 2782)]
[(u'salt', 18049), (u'olive oil', 10916), (u'black pepper', 8039), (u'onions', 7972), (u'water', 7457), (u'garlic', 7380), (u'garlic cloves', 7110), (u'sugar', 6434), (u'ground black', 5004), (u'butter', 4848), (u'soy sauce', 4822), (u'vegetable oil', 4731), (u'all-purpose flour', 4632), (u'pepper', 4438), (u'bell pepper', 4190), (u'green onions', 3550), (u'eggs', 3388), (u'chicken broth', 3386), (u'kosher salt', 3179), (u'red pepper', 3169)]

计算物品的计数器＆＃39;在多个列表中出现并返回元组（项目，列表1中的计数，列表2中的计数）

0 个答案: