Question

我有一个information数据帧，该数据帧是通过以下方式获得的：

information = pd.DataFrame.from_dict(docs.json()["hits"]["hits"])

information包含类型news的对象。对于每个新闻，我只想要_source：

    _id                                         _index            _score     _source                                            _type
0   c0b0773f94fc91938709edccf1ec4e3039e7576b    luxurynsight_v2 6.023481    {'importer': 'APItay', 'releasedAt': 147621242...   news
1   9ce6d7e015dc28497ff8ccd4915cf4104188107d    luxurynsight_v2 6.015883    {'importer': 'APItay', 'releasedAt': 152717820...   news
...

在每个_source中，我只想要name和createAt

例如，这里是news之一：

_index  _type   _id _score  _source
_headers    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [{'header': 'date', 'value': 'Fri, 23 Feb 2018...
_opengraph  luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [{'header': 'og_locale', 'value': 'en_US'}, {'...
_sums   luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [{'sum': 'decfedbfae938da88e93e75c7ebb4dc9', '...
_tags   luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [{'visible': True, 'name': 'Gucci', 'count': 3...
_users  luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [{'permission': 'public', 'id': 0}]
archive luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    True
authors luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    []
catalogs    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [Luxurynsight]
cleanUrl    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    http://www.cpp-luxury.com/gucci-debuts-art-ins...
contentType luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    text/html
createdAt   luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    1508510973592
domain  luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    www.cpp-luxury.com
excerpt luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    Gucci debuts art installation at its Ginza sto...
foundOn luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [excerpt, name]
iframe  luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    True
importer    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    APItay
language    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    en-US
name    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    Gucci debuts art installation at its Ginza sto...
plainCategories luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [AutomaticBrands, Market, AutomaticPeople, Tag]
plainTags   luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    [Gucci, Market_Japan, Alessandro Michele, Tag_...
previewImage    luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    http://www.cpp-luxury.com/wp-content/uploads/2...
publishedAt luxurynsight_v2 news    c0b0773f94fc91938709edccf1ec4e3039e7576b    6.023481    1476212420000

预期结果是：

createAt    names
2007-01-01  What Sticks from '06. Somalia Orders Islamist...
2007-01-02  Heart Health: Vitamin Does Not Prevent Death ...
2007-01-03  Google Answer to Filling Jobs Is an Algorithm...

我的尝试

>>> information._source
0    {'importer': 'APItay', 'releasedAt': 147621242...
1    {'importer': 'APItay', 'releasedAt': 152717820...
2    {'importer': 'APItay', 'releasedAt': 152418240...

问题是我们得到了一个字典数据框。如何将其转换为数据框？也许还有其他方法？

我也尝试过...

import ast
information._source = information._source.apply(lambda x: ast.literal_eval(x))

# Store in a new column
df['name'] = information._source.apply(lambda x: x['name'])

# Store in a new column
df['createAt'] = information._source.apply(lambda x: x['createAt'])

但是它给了我ValueError：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-194-968302937df5> in <module>
      1 import ast
----> 2 information._source = information._source.apply(lambda x: ast.literal_eval(x))
      3 
      4 # Store in a new column
      5 df['name'] = information._source.apply(lambda x: x['name'])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   3192             else:
   3193                 values = self.astype(object).values
-> 3194                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3195 
   3196         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-194-968302937df5> in <lambda>(x)
      1 import ast
----> 2 information._source = information._source.apply(lambda x: ast.literal_eval(x))
      3 
      4 # Store in a new column
      5 df['name'] = information._source.apply(lambda x: x['name'])

C:\ProgramData\Anaconda3\lib\ast.py in literal_eval(node_or_string)
     83                     return left - right
     84         raise ValueError('malformed node or string: ' + repr(node))
---> 85     return _convert(node_or_string)
     86 
     87 

C:\ProgramData\Anaconda3\lib\ast.py in _convert(node)
     82                 else:
     83                     return left - right
---> 84         raise ValueError('malformed node or string: ' + repr(node))
     85     return _convert(node_or_string)
     86 

ValueError: malformed node or string: {'importer': 'APItay', 'releasedAt': 1476212420000, '_tags': [{'visible': True, 'name': 'Gucci', 'count': 39, 'id': 'Gucci', 'category': ['AutomaticBrands']}, {'visible': False, 'name': 'MLI1', 'count': 39, 'id': 'staffTagging_MLI1', 'category': ['staffTagging']}, {'visible': True, 'name': 'Japan', 'count': 19, 'id': 'Market_Japan', 'category': ['Market']}, {'visible': False, 'name': 'KBN', 'count': 4, 'id': 'staffTagging_KBN', 'category': ['staffTagging']}, {'visible': False, 'name': 'JLE',

数据

def create_doc(uri, doc_data={}):
    """Create new document."""
    query = json.dumps(doc_data)
    response = requests.post(uri, data = query)#data=json.dumps({"size":10}))
    print(type(response))
    return(response)

doc_data = {
  "size": 10,
  "query": {
    "bool": {
      "must" : [
       {"term":{"text":"gucci"}}
     ]
    }
  }
 }

docs = create_doc("https://elastic:rKzWu2WbXI@db.luxurynsight.com/luxurynsight_v2/news/_search",doc_data)

Answer 1

已验证问题的答案-

# Reading the JSON file
df = pd.read_json('file.json')

# Converting the element wise _source feature datatype to dictionary
df._source = df._source.apply(lambda x: dict(x))

# Creating name column
df['name'] = df._source.apply(lambda x: x['name'])

# Creating createdAt column
df['createdAt'] = df._source.apply(lambda x: x['createdAt'])

示例输出

如何将字典的数据框转换为数据框？

我的尝试

我也尝试过...

数据

1 个答案: