我从here下载了一个示例数据集,它是一系列JSON对象。根据该网站,每个JSON对象如下所示
{
"id": "4cd223df721b722b1c40689caa52932a41fcc223",
"title": "Knowledge-rich, computer-assisted composition of Chinese couplets",
"paperAbstract": "Recent research effort in poem composition has focused on the use of automatic language generation...",
"entities": [
"Conformance testing",
"Natural language generation",
"Natural language processing",
"Parallel computing",
"Stochastic grammar",
"Web application"
],
"s2Url": "https://semanticscholar.org/paper/4cd223df721b722b1c40689caa52932a41fcc223",
"s2PdfUrl": "",
"pdfUrls": [
"https://doi.org/10.1093/llc/fqu052"
],
"authors": [
{
"name": "John Lee",
"ids": [
"3362353"
]
},
"..."
],
"inCitations": [
"c789e333fdbb963883a0b5c96c648bf36b8cd242"
],
"outCitations": [
"abe213ed63c426a089bdf4329597137751dbb3a0",
"..."
],
"year": 2016,
"venue": "DSH",
"journalName": "DSH",
"journalVolume": "31",
"journalPages": "152-163",
"sources": [
"DBLP"
],
"doi": "10.1093/llc/fqu052",
"doiUrl": "https://doi.org/10.1093/llc/fqu052",
"pmid": ""
}
最终,我只需要使用paperAbsrtract
部分。我正在将其加载到如下所示的pandas数据框中
filename = "sample-S2-records"
df = pd.read_json(filename, lines=True)
df.head()
这显示所有doi
和doiUrl
列为空。
如果我只选择抽象列并检查标题,我会看到5行中的2行为空
abstract = df['paperAbstract']
abstract.head()
0
1 The search for new administrators in complex s...
2 The human N-formyl peptide receptor (FPR) is a...
3 Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
4
Name: paperAbstract, dtype: object
看起来像我创建数据框的方法不是正确的方法。我非常有信心他们不会缺少任何专栏。
我想念什么?有什么建议吗?
答案 0 :(得分:1)
我调查了您的数据样本,并认为您得到了正确的结果。如果我们要手动解析JSON:
a = [ 'Python1/RegEx1' , 'Python2/RegEx2', 'Python3/RegEx3']
for i in a:
h.append(re.findall(r'/(\w+)', i))
for x in h:
print(x)
然后检查字典列表,这是我们看到的内容:
import json
filename = "sample-S2-records"
with open(filename, 'r') as f:
d = [json.loads(x) for x in f]
因此,实际上第一行>>> d[0]['paperAbstract']
''
字段为空。
P.S .:我认为这个问题需要解决,我怀疑这会帮助其他人