Question

我从here下载了一个示例数据集，它是一系列JSON对象。根据该网站，每个JSON对象如下所示

{
  "id": "4cd223df721b722b1c40689caa52932a41fcc223",
  "title": "Knowledge-rich, computer-assisted composition of Chinese couplets",
  "paperAbstract": "Recent research effort in poem composition has focused on the use of automatic language generation...",
  "entities": [
    "Conformance testing",
    "Natural language generation",
    "Natural language processing",
    "Parallel computing",
    "Stochastic grammar",
    "Web application"
  ],
  "s2Url": "https://semanticscholar.org/paper/4cd223df721b722b1c40689caa52932a41fcc223",
  "s2PdfUrl": "",
  "pdfUrls": [
    "https://doi.org/10.1093/llc/fqu052"
  ],
  "authors": [
    {
      "name": "John Lee",
      "ids": [
        "3362353"
      ]
    },
    "..."
  ],
  "inCitations": [
    "c789e333fdbb963883a0b5c96c648bf36b8cd242"
  ],
  "outCitations": [
    "abe213ed63c426a089bdf4329597137751dbb3a0",
    "..."
  ],
  "year": 2016,
  "venue": "DSH",
  "journalName": "DSH",
  "journalVolume": "31",
  "journalPages": "152-163",
  "sources": [
    "DBLP"
  ],
  "doi": "10.1093/llc/fqu052",
  "doiUrl": "https://doi.org/10.1093/llc/fqu052",
  "pmid": ""
}

最终，我只需要使用paperAbsrtract部分。我正在将其加载到如下所示的pandas数据框中

filename = "sample-S2-records"
df = pd.read_json(filename, lines=True) 
df.head()

这显示所有doi和doiUrl列为空。

如果我只选择抽象列并检查标题，我会看到5行中的2行为空

abstract = df['paperAbstract']
abstract.head()

0                                                     
1    The search for new administrators in complex s...
2    The human N-formyl peptide receptor (FPR) is a...
3    Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
4                                                     
Name: paperAbstract, dtype: object

看起来像我创建数据框的方法不是正确的方法。我非常有信心他们不会缺少任何专栏。

我想念什么？有什么建议吗？

Answer 1

我调查了您的数据样本，并认为您得到了正确的结果。如果我们要手动解析JSON：

a = [ 'Python1/RegEx1' , 'Python2/RegEx2', 'Python3/RegEx3']

for i in a: 
    h.append(re.findall(r'/(\w+)', i))

for x in h:
    print(x)

然后检查字典列表，这是我们看到的内容：

import json
filename = "sample-S2-records"
with open(filename, 'r') as f:
    d = [json.loads(x) for x in f]

因此，实际上第一行>>> d[0]['paperAbstract'] ''字段为空。

P.S .：我认为这个问题需要解决，我怀疑这会帮助其他人

一系列JSON对象到数据框的转换

1 个答案: