拆分推特数据列表

时间:2017-11-17 18:39:00

标签: python list twitter split

我有一个包含数百条未分开的推文的文件,所有这些都是这样格式化的:

{"text": "Just posted a photo @ Navarre Conference Center", "created_at": "Sun  Nov 13 01:52:03 +0000 2016", "coordinates": [-86.8586,  30.40299]}

我正在尝试拆分它们,以便我可以将每个部分分配给变量。

  1. 文字

  2. 时间戳

  3. 位置坐标

  4. 我能够使用.split('{}')拆分推文,但我真的不知道如何将其余内容分成我想要的三件事。

    我的基本想法不起作用:

    file = open('tweets_with_time.json' , 'r')
    line = file.readline()
    
        for line in file:
    
    
            line = line.split(',')
    
            message = (line[0])
            timestamp = (line[1])
            position = (line[2])
    
            #just to test if it's working
            print(position)
    

    谢谢!

2 个答案:

答案 0 :(得分:0)

它看起来像格式良好的JSON数据。请尝试以下方法:

import json
from pprint import pprint

file_ptr = open('tweets_with_time.json' , 'r')
data = json.load(file_ptr)
pprint(data)

它应该将您的数据解析为一个漂亮的Python字典。您可以按名称访问元素,如:

# Return the first 'coordinates' data point as a list of floats
data[0]["coordinates"]

# Return the 5th 'text' data point as a string
data[4]["text"]

答案 1 :(得分:0)

我刚下载了你的文件,它没有你说的那么糟糕。每条推文都在一条单独的行上。如果文件是JSON列表会更好,但我们仍然可以相当容易地逐行解析它。这是一个提取前10条推文的例子。

import json

fname = 'tweets_with_time.json'
with open(fname) as f:
    for i, line in enumerate(f, 1):
        # Convert this JSON line into a Python dict
        data = json.loads(line)

        # Extract the data
        message = data['text']
        timestamp = data['created_at']
        position = data['coordinates']

        # Print it
        print(i)
        print('Message:', message)
        print('Timestamp:', timestamp)
        print('Position:', position)
        print()

        #Only print the first 10 tweets
        if i == 10:
            break

不幸的是,我无法显示此脚本的输出:Stack Exchange不允许我将这些缩短的URL放入帖子中。

这是一个修改后的版本,用于切断网址上的每条消息。

import json

fname = 'tweets_with_time.json'
with open(fname) as f:
    for i, line in enumerate(f, 1):
        # Convert this JSON line to a Python dict
        data = json.loads(line)

        # Extract the data
        message = data['text']
        timestamp = data['created_at']
        position = data['coordinates']

        # Remove the URL from the message
        idx = message.find('https://')
        if idx != -1:
            message = message[:idx]

        # Print it
        print(i)
        print('Message:', message)
        print('Timestamp:', timestamp)
        print('Position:', position)
        print()

        #Only print the first 10 tweets
        if i == 10:
            break

<强>输出

1
Message: Just posted a photo @ Navarre Conference Center 
Timestamp: Sun Nov 13 01:52:03 +0000 2016
Position: [-86.8586, 30.40299]

2
Message: I don't usually drink #coffee, but I do love a good #Vietnamese drip coffee with condense milk… 
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-123.04437109, 49.26211779]

3
Message: #bestcurry☝✈️✝#johanvanaarde #kauai #rugby #surfing… 
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-159.4958861, 22.20321232]

4
Message: #thatonePerezwedding  @ Scenic Springs 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-98.68685568, 29.62182898]

5
Message: Miami trends now: Heat, Wade, VeteransDay, OneLetterOffBands and TheyMightBeACatfishIf. 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-80.19240081, 25.78111669]

6
Message: Thank you family for supporting my efforts. I love you all!… 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-117.83012, 33.65558157]

7
Message: If you're looking for work in #HONOLULU, HI, check out this #job: 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-157.7973653, 21.2868901]

8
Message: Drinking a L'Brett d'Apricot by @CrookedStave @ FOBAB — 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-87.6455, 41.8671]

9
Message: Can you recommend anyone for this #job? Barista (US) - 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-121.9766823, 38.350109]

10
Message: He makes me happy @ Frank and Bank 
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-75.69360487, 45.41268776]