Question

我有一个包含数百条未分开的推文的文件，所有这些都是这样格式化的：

{"text": "Just posted a photo @ Navarre Conference Center", "created_at": "Sun  Nov 13 01:52:03 +0000 2016", "coordinates": [-86.8586,  30.40299]}

我正在尝试拆分它们，以便我可以将每个部分分配给变量。

文字
时间戳
位置坐标

我能够使用.split('{}')拆分推文，但我真的不知道如何将其余内容分成我想要的三件事。

我的基本想法不起作用：

file = open('tweets_with_time.json' , 'r')
line = file.readline()

    for line in file:


        line = line.split(',')

        message = (line[0])
        timestamp = (line[1])
        position = (line[2])

        #just to test if it's working
        print(position)

谢谢！

Answer 1

它看起来像格式良好的JSON数据。请尝试以下方法：

import json
from pprint import pprint

file_ptr = open('tweets_with_time.json' , 'r')
data = json.load(file_ptr)
pprint(data)

它应该将您的数据解析为一个漂亮的Python字典。您可以按名称访问元素，如：

# Return the first 'coordinates' data point as a list of floats
data[0]["coordinates"]

# Return the 5th 'text' data point as a string
data[4]["text"]

Answer 2

我刚下载了你的文件，它没有你说的那么糟糕。每条推文都在一条单独的行上。如果文件是JSON列表会更好，但我们仍然可以相当容易地逐行解析它。这是一个提取前10条推文的例子。

import json

fname = 'tweets_with_time.json'
with open(fname) as f:
    for i, line in enumerate(f, 1):
        # Convert this JSON line into a Python dict
        data = json.loads(line)

        # Extract the data
        message = data['text']
        timestamp = data['created_at']
        position = data['coordinates']

        # Print it
        print(i)
        print('Message:', message)
        print('Timestamp:', timestamp)
        print('Position:', position)
        print()

        #Only print the first 10 tweets
        if i == 10:
            break

不幸的是，我无法显示此脚本的输出：Stack Exchange不允许我将这些缩短的URL放入帖子中。

这是一个修改后的版本，用于切断网址上的每条消息。

import json

fname = 'tweets_with_time.json'
with open(fname) as f:
    for i, line in enumerate(f, 1):
        # Convert this JSON line to a Python dict
        data = json.loads(line)

        # Extract the data
        message = data['text']
        timestamp = data['created_at']
        position = data['coordinates']

        # Remove the URL from the message
        idx = message.find('https://')
        if idx != -1:
            message = message[:idx]

        # Print it
        print(i)
        print('Message:', message)
        print('Timestamp:', timestamp)
        print('Position:', position)
        print()

        #Only print the first 10 tweets
        if i == 10:
            break

<强>输出

1
Message: Just posted a photo @ Navarre Conference Center 
Timestamp: Sun Nov 13 01:52:03 +0000 2016
Position: [-86.8586, 30.40299]

2
Message: I don't usually drink #coffee, but I do love a good #Vietnamese drip coffee with condense milk… 
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-123.04437109, 49.26211779]

3
Message: #bestcurry☝✈️✝#johanvanaarde #kauai #rugby #surfing… 
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-159.4958861, 22.20321232]

4
Message: #thatonePerezwedding  @ Scenic Springs 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-98.68685568, 29.62182898]

5
Message: Miami trends now: Heat, Wade, VeteransDay, OneLetterOffBands and TheyMightBeACatfishIf. 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-80.19240081, 25.78111669]

6
Message: Thank you family for supporting my efforts. I love you all!… 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-117.83012, 33.65558157]

7
Message: If you're looking for work in #HONOLULU, HI, check out this #job: 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-157.7973653, 21.2868901]

8
Message: Drinking a L'Brett d'Apricot by @CrookedStave @ FOBAB — 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-87.6455, 41.8671]

9
Message: Can you recommend anyone for this #job? Barista (US) - 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-121.9766823, 38.350109]

10
Message: He makes me happy @ Frank and Bank 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-75.69360487, 45.41268776]

拆分推特数据列表

2 个答案: