我有一个包含数百条未分开的推文的文件,所有这些都是这样格式化的:
{"text": "Just posted a photo @ Navarre Conference Center", "created_at": "Sun Nov 13 01:52:03 +0000 2016", "coordinates": [-86.8586, 30.40299]}
我正在尝试拆分它们,以便我可以将每个部分分配给变量。
文字
时间戳
位置坐标
我能够使用.split('{}')
拆分推文,但我真的不知道如何将其余内容分成我想要的三件事。
我的基本想法不起作用:
file = open('tweets_with_time.json' , 'r')
line = file.readline()
for line in file:
line = line.split(',')
message = (line[0])
timestamp = (line[1])
position = (line[2])
#just to test if it's working
print(position)
谢谢!
答案 0 :(得分:0)
它看起来像格式良好的JSON数据。请尝试以下方法:
import json
from pprint import pprint
file_ptr = open('tweets_with_time.json' , 'r')
data = json.load(file_ptr)
pprint(data)
它应该将您的数据解析为一个漂亮的Python字典。您可以按名称访问元素,如:
# Return the first 'coordinates' data point as a list of floats
data[0]["coordinates"]
# Return the 5th 'text' data point as a string
data[4]["text"]
答案 1 :(得分:0)
我刚下载了你的文件,它没有你说的那么糟糕。每条推文都在一条单独的行上。如果文件是JSON列表会更好,但我们仍然可以相当容易地逐行解析它。这是一个提取前10条推文的例子。
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line into a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
不幸的是,我无法显示此脚本的输出:Stack Exchange不允许我将这些缩短的URL放入帖子中。
这是一个修改后的版本,用于切断网址上的每条消息。
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line to a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Remove the URL from the message
idx = message.find('https://')
if idx != -1:
message = message[:idx]
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
<强>输出强>
1
Message: Just posted a photo @ Navarre Conference Center
Timestamp: Sun Nov 13 01:52:03 +0000 2016
Position: [-86.8586, 30.40299]
2
Message: I don't usually drink #coffee, but I do love a good #Vietnamese drip coffee with condense milk…
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-123.04437109, 49.26211779]
3
Message: #bestcurry☝✈️✝#johanvanaarde #kauai #rugby #surfing…
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-159.4958861, 22.20321232]
4
Message: #thatonePerezwedding @ Scenic Springs
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-98.68685568, 29.62182898]
5
Message: Miami trends now: Heat, Wade, VeteransDay, OneLetterOffBands and TheyMightBeACatfishIf.
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-80.19240081, 25.78111669]
6
Message: Thank you family for supporting my efforts. I love you all!…
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-117.83012, 33.65558157]
7
Message: If you're looking for work in #HONOLULU, HI, check out this #job:
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-157.7973653, 21.2868901]
8
Message: Drinking a L'Brett d'Apricot by @CrookedStave @ FOBAB —
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-87.6455, 41.8671]
9
Message: Can you recommend anyone for this #job? Barista (US) -
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-121.9766823, 38.350109]
10
Message: He makes me happy @ Frank and Bank
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-75.69360487, 45.41268776]