Python:如何使用特定值从列表中创建嵌套字典

时间:2016-10-09 18:36:14

标签: python python-3.x list-comprehension dict-comprehension

我为这篇长篇文章提前道歉,但我确保它很容易理解并且很清楚。

我的问题是:

  

如何使用指定的重复键从列表中创建嵌套字典?

这是一个我想做的例子,使用虚构新闻文章的数据:

{'http://www.SomeNewsWebsite.com/Article12345': 
 {'Title': 'Trump Does Another Ridiculous Thing', 
  'Source': 'Some News Website', 
  'Thumbnail': 'SomeNewsWebsite.com/image12345'}} 

阅读类似的post,我看到人们做了类似的事情,但却努力将这些想法移植到我自己的工作中。

这是我的问题的结束。下面,我发布了我的代码和由所述代码生成的示例列表,这是我用来制作这个嵌套字典的内容。它也可以在Github上找到。

到目前为止,我可以使用以下代码获取数据,删除重要位,然后制作两个列表 - 一个用于URL,一个用于标题。然后它使用Zip将它们组合成一个整洁的字典。

url = "http://www.reuters.com"

source = "Reuters"

thumbnail = "http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png"


def soup():
    """ Fetches HTML from site and turns it into a bs4 object. """
    get_html = requests.get(url)
    html = get_html.text
    make_soup = BeautifulSoup(html, 'html.parser')
    return make_soup


# Tell bs4 where to find the important information (headlines, URLs)
important_data = (soup().select(".story-content > .story-title > a"))


# Turn that important data into a string so it may be parsed using RegEx
stringed_data = ' || '.join(str(v) for v in important_data)


def get_headline():
    """ Uses Regular Expressions to find headlines. Returns a list. """
    headline = re.findall(r'(?<=">)(.*?)(?=</a>)', stringed_data)
    return headline


def get_link():
    """ Uses Regular Expressions to find links. Returns a list. """
    link = re.findall(r'(?<=<a href=")(.*?)(?=")', stringed_data)
    return link

def build_dict():
    """ Combine everything into a tidy dictionary. """
    full_urls = [i if i.startswith('http') else url + i for i in get_link()]
    reuters_dictionary = dict(zip(get_headline(), full_urls))
    return full_urls

get_link()
get_headline()
soup()
build_dict()

运行时,此代码将创建2个列表,然后创建字典。示例数据如下所示:

List of titles:(29 items long)
['Trump strikes defiant tone ahead of debate', 'Matthew swamps North Carolina, still dangerous as it heads out to sea', "Tesla's Musk says will not have to raise funds in fourth-quarter", 'Suspect arrested in fatal shooting of two California police officers', 'Russia says U.S. actions threaten its national security', 'Western-backed coalition under pressure over Yemen raid', "Fed's Fischer says job gains solid, expects growth to pick up", "Thai king's condition unstable after hemodialysis treatment: palace", 'Pope names new group of cardinals, adding to potential successors', 'Palestinian kills two people in Jerusalem, then shot dead: police', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'", 'Earnings season begins as White House race heats up', 'Russia expects OPEC to ask non members to consider joining output curb', 'Banks ponder the meaning of life as Deutsche agonizes', 'IMF says still engaged with Greece, no decision yet on bailout role', 'Pound slump exacerbates Brexit impact for German exporters: DIHK', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources', 'Ukraine military postpones withdrawal from town, cites rebel shelling', 'German police make new raid in hunt for refugee planning bomb attack', "South African President Zuma's rape accuser dies: family", 'Xi says China must speed up plans for domestic network technology', 'UberEats to expand to Berlin in 2017: Tagesspiegel', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services', 'Pressure on Trump likely to be intense at second debate with Clinton', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.", 'Evangelical leaders stick with Trump, focus on defeating Clinton', 'Citi sells its Argentinian consumer business to Banco Santander', "Itaú to pay $220 million for Citigroup's Brazil assets", 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico']


List of URLs: (29 items long)
['/article/us-usa-election-idUSKCN1290JZ', '/article/us-storm-matthew-idUSKCN129063', '/article/us-tesla-equity-solarcity-idUSKCN1290QW', '/article/us-california-police-shooting-idUSKCN1280YH', '/article/us-russia-usa-idUSKCN1290DP', '/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', '/article/us-usa-fed-fischer-idUSKCN1290JB', '/article/us-thailand-king-idUSKCN1290R8', '/article/us-pope-cardinals-idUSKCN1290C9', '/article/us-israel-palestinians-violence-idUSKCN129070', '/article/us-society-entertainment-film-idUSKCN127229', '/article/us-usa-stocks-weekahead-idUSKCN1272HS', '/article/us-oil-opec-russia-idUSKCN1290KD', '/article/us-imf-g20-banks-idUSKCN1290DX', '/article/us-imf-g20-greece-idUSKCN1290R6', '/article/us-britain-eu-germany-idUSKCN1290TZ', '/article/us-oil-opec-istanbul-idUSKCN1290N2', '/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', '/article/us-germany-bomb-idUSKCN1290D2', '/article/us-safrica-zuma-idUSKCN1290SX', '/article/us-china-internet-security-idUSKCN1290LA', '/article/us-uber-germany-eats-idUSKCN1290OB', '/article/us-china-regulations-ride-hailing-idUSKCN1280EL', '/article/us-usa-election-debate-idUSKCN1290AS', '/article/us-usa-election-clinton-idUSKCN1280Z9', '/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', '/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', '/article/us-citibank-brasil-m-a-itau-unibco-hldg-idUSKCN1280HM', '/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU']

Dictionary of titles and URLs: (29 items long)
{'Banks ponder the meaning of life as Deutsche agonizes': 'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX', 'German police make new raid in hunt for refugee planning bomb attack': 'http://www.reuters.com/article/us-germany-bomb-idUSKCN1290D2', 'Suspect arrested in fatal shooting of two California police officers': 'http://www.reuters.com/article/us-california-police-shooting-idUSKCN1280YH', 'Evangelical leaders stick with Trump, focus on defeating Clinton': 'http://www.reuters.com/article/us-usa-election-trump-evangelicals-idUSKCN1280WE', 'Xi says China must speed up plans for domestic network technology': 'http://www.reuters.com/article/us-china-internet-security-idUSKCN1290LA', "Australia's Rinehart and China's Shanghai CRED agree on deal for Kidman cattle empire": 'http://www.reuters.com/article/us-australia-china-landsale-dakang-p-f-idUSKCN12908O', 'LafargeHolcim agrees sale of Chilean business Cemento Polpaico': 'http://www.reuters.com/article/us-lafargeholcim-divestment-chile-idUSKCN1280BU', 'Citi sells Argentinian consumer unit a day after Brazil sale': 'http://www.reuters.com/article/us-citi-argentina-m-a-banco-santander-ri-idUSKCN1290SD', 'Beijing, Shanghai propose curbs on who can drive for ride-hailing services': 'http://www.reuters.com/article/us-china-regulations-ride-hailing-idUSKCN1280EL', 'Pope names new group of cardinals, adding to potential successors': 'http://www.reuters.com/article/us-pope-cardinals-idUSKCN1290C9', "Commentary: House of Lies — the uncanny allure of 'Girl on the Train'": 'http://www.reuters.com/article/us-society-entertainment-film-idUSKCN127229', 'Iranian, Iraqi oil ministers will not attend Istanbul talks: sources': 'http://www.reuters.com/article/us-oil-opec-istanbul-idUSKCN1290N2', "South African President Zuma's rape accuser dies: family": 'http://www.reuters.com/article/us-safrica-zuma-idUSKCN1290SX', 'Palestinian kills two people in Jerusalem, then shot dead: police': 'http://www.reuters.com/article/us-israel-palestinians-violence-idUSKCN129070', 'Matthew swamps North Carolina, still dangerous as it heads out to sea': 'http://www.reuters.com/article/us-storm-matthew-idUSKCN129063', 'Western-backed coalition under pressure over Yemen raid': 'http://www.reuters.com/article/us-yemen-security-coalition-pressure-idUSKCN1290JM', 'Trump strikes defiant tone ahead of debate': 'http://www.reuters.com/article/us-usa-election-idUSKCN1290JZ', 'Russia says U.S. actions threaten its national security': 'http://www.reuters.com/article/us-russia-usa-idUSKCN1290DP', 'Pressure on Trump likely to be intense at second debate with Clinton': 'http://www.reuters.com/article/us-usa-election-debate-idUSKCN1290AS', "Sanders supporters seethe over Clinton's leaked remarks to Wall St.": 'http://www.reuters.com/article/us-usa-election-clinton-idUSKCN1280Z9', "Tesla's Musk says will not have to raise funds in fourth-quarter": 'http://www.reuters.com/article/us-tesla-equity-solarcity-idUSKCN1290QW', "Fed's Fischer says job gains solid, expects growth to pick up": 'http://www.reuters.com/article/us-usa-fed-fischer-idUSKCN1290JB', 'Ukraine military postpones withdrawal from town, cites rebel shelling': 'http://www.reuters.com/article/us-ukraine-crisis-withdrawal-idUSKCN1290UL', "Thai king's condition unstable after hemodialysis treatment: palace": 'http://www.reuters.com/article/us-thailand-king-idUSKCN1290R8', 'Earnings season begins as White House race heats up': 'http://www.reuters.com/article/us-usa-stocks-weekahead-idUSKCN1272HS', 'IMF says still engaged with Greece, no decision yet on bailout role': 'http://www.reuters.com/article/us-imf-g20-greece-idUSKCN1290R6', 'Pound slump exacerbates Brexit impact for German exporters: DIHK': 'http://www.reuters.com/article/us-britain-eu-germany-idUSKCN1290TZ', 'Russia expects OPEC to ask non members to consider joining output curb': 'http://www.reuters.com/article/us-oil-opec-russia-idUSKCN1290KD', 'UberEats to expand to Berlin in 2017: Tagesspiegel': 'http://www.reuters.com/article/us-uber-germany-eats-idUSKCN1290OB'}

为清楚起见,我想使用这些数据为每个标题和网址配对创建一个字典,如下所示:

{'http://www.reuters.com/article/us-imf-g20-banks-idUSKCN1290DX': 
 {'Title': 'Banks ponder the meaning of life as Deutsche agonizes',
  'Source': 'Reuters', 
  'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'}}

非常感谢您花时间阅读,并提前感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

考虑字典理解:

newsdict = {v: {'Title': k, 
                'Source': 'Reuters', 
                'Thumbnail': 'http://logok.org/wp-content/uploads/2014/04/Reuters-logo.png'} 
           for k, v in reuters_dictionary.items()}

答案 1 :(得分:0)

这可以为您提供所需的结果:

def build_dict():
    """ Combine everything into a tidy dictionary. """
    full_urls = [i if i.startswith('http') else url + i for i in get_link()]
    reuters_dictionary = {}
    for (headline, url) in zip(get_headline(), full_urls):
        reuters_dictionary[url] = {
            'Title': headline,
            'Source': source,
            'Thumbnail' : thumbnail
        }
    return full_urls # <- I think you want to do "return reuters_dictionary" here(?)

但是,这里没有重复键。为什么你觉得需要重复的密钥?

此外,您应该重构以删除这些全局变量。

最后,如果您已经在使用BeatifulSoup,那么为什么之后会回到正则表达式?我认为在任何地方使用BeautifulSoup应该更加健壮。