我用漂亮的汤刮了Wikipedia页面上的“纽约市美食”。现在,我在提取所需数据时遇到了麻烦。
我想要的输出应如下所示:
Place1 Place2 Cuisine
The Bronx Bedfort Park Mexican, Mexican, Puerto Rican, Dominican
.
.
.
Manhattan Upper East Side German, Czech, Hungarian
代码:
html = wp.page("Cuisine_of_New_York_City").html().encode("UTF-8")
soup = BeautifulSoup(html, 'lxml')
article = soup.find('div', class_ = "div-col columns column-width")
array = article.text.split('\n')[1:len(article.text.split('\n'))-1]
array
我尝试过,但是我只有我要寻找的第一个条目
答案 0 :(得分:1)
您只需要更改方法find
。请改用find_all
:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://en.wikipedia.org/wiki/Cuisine_of_New_York_City')
soup = BeautifulSoup(page.text, 'html.parser')
articles = soup.find_all('div', class_ = "div-col columns column-width")
for article in articles:
array = article.text.split('\n')[1:len(article.text.split('\n'))-1]
print(array)
输出:
['Bedford Park – Mexican, Puerto Rican, Dominican, Korean (on 204th St.)', 'Belmont – Italian, Albanian (also known as "Arthur Avenue," "Little Italy")', 'City Island – Italian, Seafood', 'Morris Park – Italian, Albanian', 'Norwood – Filipino (formerly Irish, less so today)', 'Riverdale – Jewish', 'South Bronx – Puerto Rican, Dominican', 'Wakefield – Jamaican, West Indian', 'Woodlawn – Irish']
['Astoria – Greek, Italian, Eastern European, Brazilian, Egyptian and other Arabic', 'Bellerose – Indian and Pakistani', 'Flushing – Chinese and Korean', 'Forest Hills; Kew Gardens Hills; Rego Park – Jewish, Russian and Uzbek', 'Howard Beach; Ozone Park – Italian', 'Glendale – German and Polish', 'Jackson Heights – Indian, Pakistani, Bangladeshi, Colombian, Ecuadorian, Peruvian, Korean, Filipino and Mexican', 'Jamaica – Bangladeshi, Caribbean; African-American; African; Creole', 'Little Neck – Arab, Chinese, and Italian', 'Richmond Hill – Indian, Guyanese, West Indian, Pakistani, Bangladeshi', 'The Rockaways - Irish, Jewish', 'Woodhaven – Irish, Dominican, Mexican, Guyanese', 'Woodside; Sunnyside – Filipino, Irish, Mexican, and Romanian']
['Bay Ridge – Irish, Italian, Greek, Turkish, Lebanese, Palestinian, Yemeni and other Arabic', 'Bedford-Stuyvesant – African-American, Jamaican, Trinidadian, Puerto Rican and West Indian', 'Bensonhurst; – Italian, Chinese, Turkish, Russian, Mexican, Uzbek', 'Borough Park – Jewish, Italian, Mexican, Chinese', 'Brighton Beach – Russian, Georgian, Turkish, Pakistani and Ukrainian', 'Bushwick – Puerto Rican, Mexican, Dominican, and Ecuadorian', 'Canarsie – Jamaican, West Indian, African-American', 'Carroll Gardens – Italian', 'Crown Heights – Jamaican, West Indian, and Jewish', 'East New York – African-American, Dominican, and Puerto Rican', 'Flatbush – Jamaican, Haitian, and Creole', 'Greenpoint – Polish and Ukrainian', 'Kensington – Bengali, Pakistani, Mexican, Uzbek, and Polish', 'Midwood – Jewish, Italian, Russian, and Pakistani', 'Park Slope – Italian, Irish, French, and Puerto Rican (formerly)', 'Red Hook – Puerto Rican, African-American, and Italian', 'Sheepshead Bay – Seafood, Russian, and Italian', 'Sunset Park – Puerto Rican, Chinese, Arab, Mexican and Italian', 'Williamsburg – Italian, Jewish, Dominican and Puerto Rican']
['Chinatown – Chinese and Vietnamese', 'East Harlem – Puerto Rican, Mexican, Dominican, Chinese-Cuban and Italian', 'East Village – Japanese, Korean, Indian and Ukrainian', 'Greenwich Village – Italian', 'Harlem – Italian, African-American, Latin American, West Indian, and West African', 'Koreatown – Korean', 'Little Italy – Italian', 'Lower East Side – Puerto Rican, Jewish, Italian, and Latin American', 'Murray Hill – Indian, Pakistani and Bangladeshi', 'Washington Heights – Dominican, Puerto Rican, Italian and Jewish', 'Upper East Side – German, Czech, Hungarian']
['Manhattan clam chowder', 'New York-style cheesecake', 'New York-style pizza', 'New York-style bagel', 'New York-style pastrami', 'Corned beef[4]', 'Baked pretzels', 'New York-style Italian ice', 'Knish', 'Eggs Benedict', 'Chopped Cheese', 'Lobster Newberg', 'Waldorf Salad', 'Doughnut', 'Delmonico steak', 'Black and white cookie', 'Bacon, egg and cheese sandwich on a roll']
['celery soda', 'New York-style pastrami, pastrami on rye', 'brisket[4]', 'corned beef[4]', 'tongue', 'knish[4]', 'New York-style bagels and lox (see also: appetizing)[4]', 'Bagel and cream cheese', 'cream cheese', 'whitefish with and without pike', 'Gefilte fish', 'blintzes[4]', 'potato pancake', 'bialy[4]', 'challah bread', 'matzo', 'egg cream', 'pickled cucumbers (especially dill pickles)', 'kishka', 'potato kugel', 'chopped chicken liver', 'matzo ball soup', 'lokshen soup']
['Bloody Mary', 'Chef salad', 'Chicken à la King[13]', 'Chicken and waffles', 'Chicken Divan', 'Cronut', 'Delmonico steak', 'Egg cream', 'Eggs Benedict', "General Tso's chicken", 'Ice cream cone', 'Lobster Newburg', 'Mallomars[14]', 'Manhattan', 'Manhattan Special – A type of carbonated espresso drink.', 'Pasta primavera', 'Penne alla Vodka', 'Reuben sandwich', 'Steak Diane', 'Spaghetti and meatballs', 'Vichyssoise', 'Waldorf salad']
['arepas', 'calzones', 'Chinese kebabs (chuanr)', 'churros', 'cuchifritos', 'dumplings', 'falafel', 'fried chicken', 'fried noodles', "Gray's Papaya, Papaya King – combined papaya juice/hot dog stands", 'corndogs', 'grilled chestnuts[3]', 'gyros/shawarma', 'Halal chicken/lamb over rice[15]', 'hamburgers', 'honey-roasted peanuts, almonds, cashews, and coconut', 'hot dog stands', 'Italian ice', 'Italian sausage, bratwurst', 'knishes', 'Mister Softee ice cream', 'muffins', 'piragua', 'pizza, especially New York-style pizza', 'soft pretzels[3]', 'souvlaki/shish kebab', 'stromboli', 'tacos', 'take-out soup, as Soup Kitchen International']
['A&P', 'AriZona Beverage Company', "Balducci's", "Bamonte's", 'Benihana', 'Blimpie', 'C-Town Supermarkets', 'Caffe Reggio - the first espresso bar to introduce cappuccino in America', 'Carnegie Deli', 'Carvel (restaurant)', 'Clinton St. Baking Company & Restaurant', 'Dean & DeLuca', "Dr. Brown's – sodas", "Drake's Cakes – cakes, pies, pastries", 'Domino Foods', "Entenmann's – cakes, pies, pastries", 'Fairway Market', 'Ferrara Bakery and Cafe - first Italian caffe to open up in America', 'Food Network – cable TV channel', 'Fraunces Tavern – George Washington said goodbye to his troops here. Some departments of his new federal government were originally located here.', 'Golden Krust Caribbean Bakery & Grill', 'Gray\'s Papaya – hot dog institution where there is always a "recession special"', 'Grotta Azzurra', "Grimaldi's Pizzeria", 'Häagen-Dazs', 'Hebrew National', "Junior's – The World's Most Fabulous Cheesecake", "Katz's Deli", 'Kesté', 'Key Food supermarket', 'L&B Spumoni Gardens', "Lindy's", "Lombardi's – first pizzeria in America", "Nathan's", 'Now and Later candy', 'Papaya King', 'PepsiCo, Inc.', 'Peter Luger Steak House', "Ray's Pizza – a fierce debate over which was the original", 'Russian Tea Room', 'Second Avenue Deli', 'Serendipity 3', 'Sbarro', 'Shake Shack', 'Snapple', "Stella D'oro – biscuits, cookies", "T.G.I. Friday's – originally a NYC bar", "Totonno's - first pizzeria to open up in Brooklyn", 'The Halal Guys', 'Vitamin Water', 'Yoo-hoo – chocolate drink', "Zabar's"]
['New York Food Anywhere', 'Who Cooked That Up?', 'New York Gastronomic & Cultural Food Tours', "Explore Manhattan's Unique Neighborhoods and Foods", 'The Best Of Brooklyn Multicultural Ethnic Neighborhood Food Tasting and Culture Tour', 'Find NYC street food vendors', 'Great Eating In Flushing']
编辑:
下面是一个片段,其中包含place1并将数据存储在字典中:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://en.wikipedia.org/wiki/Cuisine_of_New_York_City')
soup = BeautifulSoup(page.text, 'html.parser')
results = {}
articles = soup.find_all('div', class_ = "div-col columns column-width")
for article in articles:
# Check if its the right element
if article.find_previous_sibling('h2').find('span').get('id') == 'Enclaves_reflecting_national_cuisines':
category = article.find_previous_sibling('h3')
title_key = category.find('span',{'class':'mw-headline'}).get_text()
if not title_key in results.keys():
results[title_key] = []
results[title_key] = article.text.split('\n')[1:len(article.text.split('\n'))-1]
print(results)
输出:
{'Brooklyn': ['Bay Ridge – Irish, Italian, Greek, Turkish, Lebanese, '
'Palestinian, Yemeni and other Arabic',
'Bedford-Stuyvesant – African-American, Jamaican, Trinidadian, '
'Puerto Rican and West Indian',
'Bensonhurst; – Italian, Chinese, Turkish, Russian, Mexican, '
'Uzbek',
'Borough Park – Jewish, Italian, Mexican, Chinese',
'Brighton Beach – Russian, Georgian, Turkish, Pakistani and '
'Ukrainian',
'Bushwick – Puerto Rican, Mexican, Dominican, and Ecuadorian',
'Canarsie – Jamaican, West Indian, African-American',
'Carroll Gardens – Italian',
'Crown Heights – Jamaican, West Indian, and Jewish',
'East New York – African-American, Dominican, and Puerto Rican',
'Flatbush – Jamaican, Haitian, and Creole',
'Greenpoint – Polish and Ukrainian',
'Kensington – Bengali, Pakistani, Mexican, Uzbek, and Polish',
'Midwood – Jewish, Italian, Russian, and Pakistani',
'Park Slope – Italian, Irish, French, and Puerto Rican '
'(formerly)',
'Red Hook – Puerto Rican, African-American, and Italian',
'Sheepshead Bay – Seafood, Russian, and Italian',
'Sunset Park – Puerto Rican, Chinese, Arab, Mexican and Italian',
'Williamsburg – Italian, Jewish, Dominican and Puerto Rican'],
'Manhattan': ['Chinatown – Chinese and Vietnamese',
'East Harlem – Puerto Rican, Mexican, Dominican, Chinese-Cuban '
'and Italian',
'East Village – Japanese, Korean, Indian and Ukrainian',
'Greenwich Village – Italian',
'Harlem – Italian, African-American, Latin American, West '
'Indian, and West African',
'Koreatown – Korean',
'Little Italy – Italian',
'Lower East Side – Puerto Rican, Jewish, Italian, and Latin '
'American',
'Murray Hill – Indian, Pakistani and Bangladeshi',
'Washington Heights – Dominican, Puerto Rican, Italian and '
'Jewish',
'Upper East Side – German, Czech, Hungarian'],
'Queens': ['Astoria – Greek, Italian, Eastern European, Brazilian, Egyptian '
'and other Arabic',
'Bellerose – Indian and Pakistani',
'Flushing – Chinese and Korean',
'Forest Hills; Kew Gardens Hills; Rego Park – Jewish, Russian and '
'Uzbek',
'Howard Beach; Ozone Park – Italian',
'Glendale – German and Polish',
'Jackson Heights – Indian, Pakistani, Bangladeshi, Colombian, '
'Ecuadorian, Peruvian, Korean, Filipino and Mexican',
'Jamaica – Bangladeshi, Caribbean; African-American; African; '
'Creole',
'Little Neck – Arab, Chinese, and Italian',
'Richmond Hill – Indian, Guyanese, West Indian, Pakistani, '
'Bangladeshi',
'The Rockaways - Irish, Jewish',
'Woodhaven – Irish, Dominican, Mexican, Guyanese',
'Woodside; Sunnyside – Filipino, Irish, Mexican, and Romanian'],
'The Bronx': ['Bedford Park – Mexican, Puerto Rican, Dominican, Korean (on '
'204th St.)',
'Belmont – Italian, Albanian (also known as "Arthur Avenue," '
'"Little Italy")',
'City Island – Italian, Seafood',
'Morris Park – Italian, Albanian',
'Norwood – Filipino (formerly Irish, less so today)',
'Riverdale – Jewish',
'South Bronx – Puerto Rican, Dominican',
'Wakefield – Jamaican, West Indian',
'Woodlawn – Irish']}
答案 1 :(得分:0)
您可以找到所需的标题,然后找到相应的位置和食物类型:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/Cuisine_of_New_York_City').text, 'html.parser')
headers = [i.span.text for i in d.find_all('h3') if i.find('span', {'class':'mw-headline'})]
final_result = {a:[i.text for i in b.find_all('li')] for a, b in zip(headers, d.find_all('div', {'class':'div-col columns column-width'}))}
输出:
{'The Bronx': ['Bedford Park – Mexican, Puerto Rican, Dominican, Korean (on 204th St.)', 'Belmont – Italian, Albanian (also known as "Arthur Avenue," "Little Italy")', 'City Island – Italian, Seafood', 'Morris Park – Italian, Albanian', 'Norwood – Filipino (formerly Irish, less so today)', 'Riverdale – Jewish', 'South Bronx – Puerto Rican, Dominican', 'Wakefield – Jamaican, West Indian', 'Woodlawn – Irish'], 'Queens': ['Astoria – Greek, Italian, Eastern European, Brazilian, Egyptian and other Arabic', 'Bellerose – Indian and Pakistani', 'Flushing – Chinese and Korean', 'Forest Hills; Kew Gardens Hills; Rego Park – Jewish, Russian and Uzbek', 'Howard Beach; Ozone Park – Italian', 'Glendale – German and Polish', 'Jackson Heights – Indian, Pakistani, Bangladeshi, Colombian, Ecuadorian, Peruvian, Korean, Filipino and Mexican', 'Jamaica – Bangladeshi, Caribbean; African-American; African; Creole', 'Little Neck – Arab, Chinese, and Italian', 'Richmond Hill – Indian, Guyanese, West Indian, Pakistani, Bangladeshi', 'The Rockaways - Irish, Jewish', 'Woodhaven – Irish, Dominican, Mexican, Guyanese', 'Woodside; Sunnyside – Filipino, Irish, Mexican, and Romanian'], 'Brooklyn': ['Bay Ridge – Irish, Italian, Greek, Turkish, Lebanese, Palestinian, Yemeni and other Arabic', 'Bedford-Stuyvesant – African-American, Jamaican, Trinidadian, Puerto Rican and West Indian', 'Bensonhurst; – Italian, Chinese, Turkish, Russian, Mexican, Uzbek', 'Borough Park – Jewish, Italian, Mexican, Chinese', 'Brighton Beach – Russian, Georgian, Turkish, Pakistani and Ukrainian', 'Bushwick – Puerto Rican, Mexican, Dominican, and Ecuadorian', 'Canarsie – Jamaican, West Indian, African-American', 'Carroll Gardens – Italian', 'Crown Heights – Jamaican, West Indian, and Jewish', 'East New York – African-American, Dominican, and Puerto Rican', 'Flatbush – Jamaican, Haitian, and Creole', 'Greenpoint – Polish and Ukrainian', 'Kensington – Bengali, Pakistani, Mexican, Uzbek, and Polish', 'Midwood – Jewish, Italian, Russian, and Pakistani', 'Park Slope – Italian, Irish, French, and Puerto Rican (formerly)', 'Red Hook – Puerto Rican, African-American, and Italian', 'Sheepshead Bay – Seafood, Russian, and Italian', 'Sunset Park – Puerto Rican, Chinese, Arab, Mexican and Italian', 'Williamsburg – Italian, Jewish, Dominican and Puerto Rican'], 'Staten Island': ['Chinatown – Chinese and Vietnamese', 'East Harlem – Puerto Rican, Mexican, Dominican, Chinese-Cuban and Italian', 'East Village – Japanese, Korean, Indian and Ukrainian', 'Greenwich Village – Italian', 'Harlem – Italian, African-American, Latin American, West Indian, and West African', 'Koreatown – Korean', 'Little Italy – Italian', 'Lower East Side – Puerto Rican, Jewish, Italian, and Latin American', 'Murray Hill – Indian, Pakistani and Bangladeshi', 'Washington Heights – Dominican, Puerto Rican, Italian and Jewish', 'Upper East Side – German, Czech, Hungarian'], 'Manhattan': ['Manhattan clam chowder', 'New York-style cheesecake', 'New York-style pizza', 'New York-style bagel', 'New York-style pastrami', 'Corned beef[4]', 'Baked pretzels', 'New York-style Italian ice', 'Knish', 'Eggs Benedict', 'Chopped Cheese', 'Lobster Newberg', 'Waldorf Salad', 'Doughnut', 'Delmonico steak', 'Black and white cookie', 'Bacon, egg and cheese sandwich on a roll'], 'Food associated with or popularized in New York City': ['celery soda', 'New York-style pastrami, pastrami on rye', 'brisket[4]', 'corned beef[4]', 'tongue', 'knish[4]', 'New York-style bagels and lox (see also: appetizing)[4]', 'Bagel and cream cheese', 'cream cheese', 'whitefish with and without pike', 'Gefilte fish', 'blintzes[4]', 'potato pancake', 'bialy[4]', 'challah bread', 'matzo', 'egg cream', 'pickled cucumbers (especially dill pickles)', 'kishka', 'potato kugel', 'chopped chicken liver', 'matzo ball soup', 'lokshen soup'], 'Dishes invented or claimed in New York City': ['Bloody Mary', 'Chef salad', 'Chicken à la King[13]', 'Chicken and waffles', 'Chicken Divan', 'Cronut', 'Delmonico steak', 'Egg cream', 'Eggs Benedict', "General Tso's chicken", 'Ice cream cone', 'Lobster Newburg', 'Mallomars[14]', 'Manhattan', 'Manhattan Special – A type of carbonated espresso drink.', 'Pasta primavera', 'Penne alla Vodka', 'Reuben sandwich', 'Steak Diane', 'Spaghetti and meatballs', 'Vichyssoise', 'Waldorf salad']}