如何删除json中的重复项并重新排列输出?

时间:2018-06-04 20:36:02

标签: python json

如果有人能帮助我,我会很高兴!我正在尝试为cashpoint.dk创建一个webscraper,它将为给定的URL获取足球赔率。 在我的任务中,我试图将解析后的数据提取到json,我也在考虑使用sqlite3数据库,尽管如此,使用我的json提取的输出实际上是在“窃听”我!

如何将我的json代码格式化为显示此格式的格式?

{
"bettext": "Hvem vinder kampen?"
     "team1": "Rusland"
     "team2": "Saudi Arabien"

     "tip": "1"
        "odds:" "138"
     "tip": "3"
        "odds: "460"
     "tip": "2"
        "odds: "926"
}

这是表达此内容的原始格式:

- Russia vs. Saudia Arabia,
- Who will win?,
- 1 (Russia) at odds 1,38,
- 3 (Draw) at odds 4,60, 
- 2 (Saudi Arabia) at odds 9,26

{
          "bettext": "Hvem vinder kampen?",
          "odds": "138",
          "team1": "Rusland",
          "team2": "Saudi Arabien",
          "tip": "1"
}
{
          "bettext": "Hvem vinder kampen?",
          "odds": "138",
          "team1": "Rusland",
          "team2": "Saudi Arabien",
          "tip": "1"
}
{
          "bettext": "Hvem vinder kampen?",
          "odds": "460",
          "team1": "Rusland",
          "team2": "Saudi Arabien",
          "tip": "3"
}
{
          "bettext": "Hvem vinder kampen?",
          "odds": "926",
          "team1": "Rusland",
          "team2": "Saudi Arabien",
          "tip": "2"
}

我的问题还在于我在dict中有完全重复的对象。 下面的代码是我用来运行它的代码。

import demjson
import json
import itertools, json
import re
from bs4 import BeautifulSoup
import requests

url = "https://www.cashpoint.dk/en/?r=bets/xtra&group=461392&game=312004790"
print(url)

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

class Scraper():

    def __init__(self):

        self.tables = soup.select('table.sportbet_extra_list_table')

        for table in self.tables:
            self.fields = table.select('.sportbet_extra_rate_content')
            for field in self.fields:
                self.js_obj = re.search('{.+}', field['onclick']).group()
                self.bet = demjson.decode(self.js_obj)
                # print(self.bet)
                # print((self.bet['team1'], self.bet['team2'], self.bet['bettext'], self.bet['tiptext'], self.bet['tip']))

                prettyjson = {
                    'tip':      str(self.bet['tip']),
                    'team1':    str(self.bet['team1']),
                    'team2':    str(self.bet['team2']),
                    'bettext':  str(self.bet['bettext']),
                    'odds':     str(self.bet['odd']),

                }

                dumpit = json.dumps(prettyjson, ensure_ascii=True, sort_keys=True, indent=10, separators=(',', ': '))
                print(dumpit)


                with open('result.json', 'a') as outfile:
                    for sprettyjson in self.bet:
                        json.dump(prettyjson, outfile, ensure_ascii=True, sort_keys=True, indent=10, separators=(',', ': '))
                        outfile.write('\n')

1 个答案:

答案 0 :(得分:0)

请参阅我的评论,以帮助澄清您的要求。

我的理解是,您正在尝试将多个JSON对象减少为单个对象结构,以消除不必要数据的重复。

首先要记住的是,JSON对象在每个范围级别只能有一个标记实例。

这不行:

{
  "tag":"value",
  "tag":"value"
}

没关系:

{
  "tag":"value",
  "subtag":{ 
             "tag:"value"
           }
}

在您的情况下,您的“子标签”应该是tips个对象的数组,允许您根据需要重复赔率和提示标记。

尝试重新编写代码以生成以下内容:

{
 "bettext": "Hvem vinder kampen?",
 "team1": "Rusland",
 "team2": "Saudi Arabien",
 "tips":[{"tip": "1",
          "odds:" "138"},
         {"tip": "3",
          "odds: "460"},
         {"tip": "2",
          "odds: "926"}]     
}