从JSON文件中删除重复条目 - BeautifulSoup

时间:2018-05-03 17:19:21

标签: python json beautifulsoup

我正在运行一个脚本来浏览网站以获取教科书信息,我的脚本正常运行。但是,当它写入JSON文件时,它会给我重复的结果。我试图弄清楚如何从JSON文件中删除重复项。这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

urls = ['https://open.bccampus.ca/find-open-textbooks/', 
'https://open.bccampus.ca/find-open-textbooks/?start=10']

data = []
#opening up connection and grabbing page
for url in urls:
    uClient = urlopen(url)
    page_html = uClient.read()
    uClient.close()

    #html parsing
    page_soup = soup(page_html, "html.parser")

    #grabs info for each textbook
    containers = page_soup.findAll("h4")

    for container in containers:
       item = {}
       item['type'] = "Textbook"
       item['title'] = container.parent.a.text
       item['author'] = container.nextSibling.findNextSibling(text=True)
       item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + container.parent.a["href"]
       item['source'] = "BC Campus"
       data.append(item) # add the item to the list

with open("./json/bc.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

以下是JSON输出的示例

{
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Exploring Movie Construction and Production",
"author": " John Reich, SUNY Genesee Community College",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}, {
"type": "Textbook",
"title": "Project Management",
"author": " Adrienne Watt",
"link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
"source": "BC Campus"
}

3 个答案:

答案 0 :(得分:1)

想出来。以下是其他人遇到此问题的解决方案:

textbook_list = []
for item in data:
    if item not in textbook_list:
        textbook_list.append(item)

with open("./json/bc.json", "w") as writeJSON:
    json.dump(textbook_list, writeJSON, ensure_ascii=False)

答案 1 :(得分:0)

您无需删除任何类型的重复项。

唯一的需要是更新代码。

  

请继续阅读。我已经提供了与此问题相关的详细说明。另外,不要忘记检查我编写的用于调试代码的gist https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c

»问题出在哪里?

我知道你想要这个,因为你得到了重复的词典。

这是因为您选择的容器为h4元素& F 或每本书的详细信息,指定的页面链接https://open.bccampus.ca/find-open-textbooks/https://open.bccampus.ca/find-open-textbooks/?start=10 有2 h4个元素。

这就是为什么,而不是获得20个项目的列表(每页10个)作为容器列表 得到两倍,即40个项目的列表,其中每个项目是h4元素。

您可能会为这40个项目中的每个项目获得不同的不同值,但问题在于选择父项。 因为它给出相同的元素所以相同的文本。

让我们通过假设以下虚拟代码来澄清问题。

  

注意:您还可以访问并检查https://gist.github.com/hygull/44cfdc1d4e703b70eb14f16fec14bf2c,因为它包含我创建的Python代码,用于调试和解决此问题。你可能会得到一些IDEA。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>
<li> <!-- 2nd book -->
    <h4>
        <a> Text 3 </a>
    </h4>
    <h4>
        <a> Text 4 </a>
    </h4>
</li>
...
...
<li> <!-- 20th book -->
    <h4>
        <a> Text 39 </a>
    </h4>
    <h4>
        <a> Text 40 </a>
    </h4>
</li>

»» containers = page_soup.find_all(“h4”); 会给出以下h4元素列表。

[
    <h4>
        <a> Text 1 </a>
    </h4>,
    <h4>
        <a> Text 2 </a>
    </h4>,
    <h4>
        <a> Text 3 </a>
    </h4>,
    <h4>
        <a> Text 4 </a>
    </h4>,
    ...
    ...
    ...
    <h4>
        <a> Text 39 </a>
    </h4>,
    <h4>
        <a> Text 40 </a>
    </h4>
]

»»如果是您的代码,内部for循环的第一次迭代将在下面的元素中引用容器变量。

<h4>
    <a> Text 1 </a>
</h4>

»»第二次迭代将下面的元素称为容器变量。

<h4>
    <a> Text 1 </a>
</h4>

»»在内部for循环的上述(第1次,第2次)迭代中, container.parent; 将给出以下元素。

<li> <!-- 1st book -->
    <h4>
        <a> Text 1 </a>
    </h4>
    <h4>
        <a> Text 2 </a>
    </h4>
</li>

»» container.parent.a 将提供以下元素。

<a> Text 1 </a>

»»最后, container.parent.a.text 将以下文字作为我们前两本书的书名。

Text 1

这就是为什么我们将重复的词典作为动态title&amp; author也是一样的。

让我们逐一摆脱这个问题。

»网页详细信息:

  1. 我们有2个网页的链接。
  2. enter image description here

    enter image description here

    1. 每个网页都有10本教科书的详细信息。

    2. 每本书的详细信息都包含2个h4元素。

    3. 总计,2x10x2 = 40 h4元素。

    4. »我们的目标:

      1. 我们的目标是只获得20个词典的数组/列表而不是40个。

      2. 因此需要按容器列表迭代2个项目,即 只需在每次迭代中跳过1个项目。

      3. »修改后的工作代码:

        from urllib.request import urlopen
        from bs4 import BeautifulSoup as soup
        import json
        
        urls = [
          'https://open.bccampus.ca/find-open-textbooks/', 
          'https://open.bccampus.ca/find-open-textbooks/?start=10'
        ]
        
        data = []
        
        #opening up connection and grabbing page
        for url in urls:
            uClient = urlopen(url)
            page_html = uClient.read()
            uClient.close()
        
            #html parsing
            page_soup = soup(page_html, "html.parser")
        
            #grabs info for each textbook
            containers = page_soup.find_all("h4")
        
            for index in range(0, len(containers), 2):
                item = {}
                item['type'] = "Textbook"
                item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
                item['source'] = "BC Campus"
                item['title'] = containers[index].parent.a.text
                item['authors'] = containers[index].nextSibling.findNextSibling(text=True)
        
            data.append(item) # add the item to the list
        
        with open("./json/bc-modified-final.json", "w") as writeJSON:
          json.dump(data, writeJSON, ensure_ascii=False)
        

        »输出:

        [
            {
                "type": "Textbook",
                "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
                "authors": " Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
                "source": "BC Campus"
            },
            {
                "type": "Textbook",
                "title": "Exploring Movie Construction and Production",
                "authors": " John Reich, SUNY Genesee Community College",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
                "source": "BC Campus"
            },
            {
                "type": "Textbook",
                "title": "Project Management",
                "authors": " Adrienne Watt",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8678fbae-6724-454c-a796-3c6667d826be&contributor=&keyword=&subject=",
                "source": "BC Campus"
            },
            ...
            ...
            ...
            {
                "type": "Textbook",
                "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
                "authors": " Michelle Bonczek Evory. Western Michigan University",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
                "source": "BC Campus"
            }
        ]
        

        最后,我尝试修改您的代码并添加了更多详细信息descriptiondate&amp; categories到词典对象。

          

        Python版本:3.6

             

        依赖:pip install beautifulsoup4

        »修改后的工作代码(增强版):

        from urllib.request import urlopen
        from bs4 import BeautifulSoup as soup
        import json
        
        urls = [
            'https://open.bccampus.ca/find-open-textbooks/', 
            'https://open.bccampus.ca/find-open-textbooks/?start=10'
        ]
        
        data = []
        
        #opening up connection and grabbing page
        for url in urls:
            uClient = urlopen(url)
            page_html = uClient.read()
            uClient.close()
        
            #html parsing
            page_soup = soup(page_html, "html.parser")
        
            #grabs info for each textbook
            containers = page_soup.find_all("h4")
        
            for index in range(0, len(containers), 2):
                item = {}
        
                # Store book's information as per given the web page (all 5 are dynamic)
                item['title'] = containers[index].parent.a.text
                item["catagories"] = [a_tag.text for a_tag in containers[index + 1].find_all('a')]
                item['authors'] = containers[index].nextSibling.findNextSibling(text=True).strip()
                item['date'] = containers[index].parent.find_all("strong")[1].findNextSibling(text=True).strip()
                item["description"] = containers[index].parent.p.text.strip()
        
                # Store extra information (1st is dynamic, last 2 are static)
                item['link'] = "https://open.bccampus.ca/find-open-textbooks/" + containers[index].parent.a["href"]
                item['source'] = "BC Campus"
                item['type'] = "Textbook"
        
                data.append(item) # add the item to the list
        
        with open("./json/bc-modified-final-my-own-version.json", "w") as writeJSON:
            json.dump(data, writeJSON, ensure_ascii=False)
        

        »输出(增强版):

        [
            {
                "title": "Vital Sign Measurement Across the Lifespan - 1st Canadian edition",
                "catagories": [
                    "Ancillary Resources"
                ],
                "authors": "Jennifer L. Lapum, Margaret Verkuyl, Wendy Garcia, Oona St-Amant, Andy Tan, Ryerson University",
                "date": "May 3, 2018",
                "description": "Description: The purpose of this textbook is to help learners develop best practices in vital sign measurement. Using a multi-media approach, it will provide opportunities to read about, observe, practice, and test vital sign measurement.",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=feacda80-4fc1-40a5-b713-d6be6a73abe4&contributor=&keyword=&subject=",
                "source": "BC Campus",
                "type": "Textbook"
            },
            {
                "title": "Exploring Movie Construction and Production",
                "catagories": [
                    "Adopted"
                ],
                "authors": "John Reich, SUNY Genesee Community College",
                "date": "May 2, 2018",
                "description": "Description: Exploring Movie Construction and Production contains eight chapters of the major areas of film construction and production. The discussion covers theme, genre, narrative structure, character portrayal, story, plot, directing style, cinematography, and editing. Important terminology is defined and types of analysis are discussed and demonstrated. An extended example of how a movie description reflects the setting, narrative structure, or directing style is used throughout the book to illustrate ...[more]",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=19892992-ae43-48c4-a832-59faa1d7108b&contributor=&keyword=&subject=",
                "source": "BC Campus",
                "type": "Textbook"
            },
            ...
            ...
            ...
            {
                "title": "Naming the Unnamable: An Approach to Poetry for New Generations",
                "catagories": [],
                "authors": "Michelle Bonczek Evory. Western Michigan University",
                "date": "Apr 27, 2018",
                "description": "Description: Informed by a writing philosophy that values both spontaneity and discipline, Michelle Bonczek Evory’s Naming the Unnameable: An Approach to Poetry for New Generations  offers practical advice and strategies for developing a writing process that is centered on play and supported by an understanding of America’s rich literary traditions. With consideration to the psychology of invention, Bonczek Evory provides students with exercises aimed to make writing in its early stages a form of play that ...[more]",
                "link": "https://open.bccampus.ca/find-open-textbooks/?uuid=8880b4d1-7f62-42fc-a912-3015f216f195&contributor=&keyword=&subject=",
                "source": "BC Campus",
                "type": "Textbook"
            }
        ]
        

        就是这样。感谢。

答案 2 :(得分:0)

我们最好使用set数据结构而不是列表。它不会保留订单,但它不会像列表一样存储重复项。

更改您的代码

data = set()

data.append(item)

data.add(item)

del