如何从数据集中删除无用的元素

时间:2019-03-07 10:52:01

标签: python json preprocessor

我有一个数据集,它看起来如下:

Year2018 %>%
  group_by(Ward) %>%
  mutate(mean = mean(price)) %>%
  mutate(max = max(price)) %>%
  mutate(min = min(price)) %>%
  ungroup() %>%
  unique()

我想从该数据集中删除 {0: {"address": 0, "ctag": "TOP", "deps": defaultdict(<class "list">, {"ROOT": [6, 51]}), "feats": "", "head": "", "lemma": "", "rel": "", "tag": "TOP", "word": ""}, 1: {"address": 1, "ctag": "Ne", "deps": defaultdict(<class "list">, {"NPOSTMOD": [2]}), "feats": "_", "head": 6, "lemma": "اشرف", "rel": "SBJ", "tag": "Ne", "word": "اشرف"}, 。我尝试了这段代码,但由于"deps":...?的值在字典的每个元素中都不同而无法正常工作。

"depts":

3 个答案:

答案 0 :(得分:1)

正确的方法是修复产生文本文件的代码。 defaultdict(<class "list">, {"ROOT": [6, 51]})提示需要更智能的格式时,它使用了简单的repr

如果无法真正解决问题,那么以下只是穷人的解决方法。

摆脱"deps": ...很容易:一次读取一行文件并丢弃以""deps"开头的任何文件就足够了(忽略初始空白)。但这还不够,因为当json坚持要求键仅为文本时,文件包含数字键。因此必须标识数字键并用引号引起来。

这可能允许加载文件:

导入 将simplejson导入为simplejson

with open("../data/cleaned.txt", 'r') as fp:
    k = ''.join(re.sub(r'(?<!\w)(\d+)', r'"\1"',line)
        for line in fp if not line.strip().startswith('"deps"'))

# remove an eventual last comma
k = re.sub(r',[\s\n]*$', '', k, re.DOTALL)

# uncomment if the file does not contain the last }
# k += '}'

js = json.loads(k)

答案 1 :(得分:0)

尝试

import json
with open("../data/cleaned.txt", 'r') as fp:
    data = json.load(fp)
    for key, value in data.items():
        value.pop("deps", None)

现在,您将拥有不包含deps的数据。万一您想将记录转储到新文件中

json.dump(data, "output.json")

答案 2 :(得分:0)

怎么样

#!/usr/bin/env python
# -*- coding: utf-8 -*-

data = {0: {"address": 0,
            "ctag": "TOP",
            "deps": 'something',
            "feats": "",
            "head": "",
            "lemma": "",
            "rel": "",
            "tag": "TOP",
            "word": ""},
        1: {"address": 1,
            "ctag": "Ne",
            "deps": 'something',
            "feats": "_",
            "head": 6,
            "lemma": "اشرف",
            "rel": "SBJ",
            "tag": "Ne",
            "word": "اشرف"}}

for value in data.values():
    if 'deps' in value:
        del value['deps']