Question

我的目标是（1）导入Twitter JSON，（2）提取感兴趣的数据，（3）为感兴趣的变量创建pandas数据框。这是我的代码：

console.log( moment('2015/03/03', 'YYYY/MM/DD', true).toISOString() );
2015-03-02T23:00:00.000Z

console.log( moment('2015-03-03', 'YYYY-MM-DD', true).toISOString() );
2015-03-02T23:00:00.000Z

console.log( moment('2015/04/03', 'YYYY/MM/DD', true).toISOString() );
2015-04-02T22:00:00.000Z

console.log( moment('2015-04-03', 'YYYY-MM-DD', true).toISOString() );
2015-04-02T22:00:00.000Z

到目前为止，一切似乎都运转良好。

现在，Geo的提取值导致以下示例：

import json
import pandas as pd

tweets = []
for line in open('00.json'):
    try:
        tweet = json.loads(line)
        tweets.append(tweet)
    except:
        continue

# Tweets often have missing data, therefore use -if- when extracting "keys"

tweet = tweets[0]

ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] 
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]

# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
               'Text':pd.Index(text),
               'Lang':pd.Index(lang),
               'Geo':pd.Index(geo),
               'Place':pd.Index(place)})

# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]

除了我尝试使用的方括号内的坐标以外的所有内容：

df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}

请告知正确获取坐标值的方法。

Answer 1

您的问题中的以下行表明这是了解返回对象的基础数据类型的问题。

df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}

你在这里返回一个Python字典 - 而不是一个字符串！如果只想返回坐标值，只需使用'coordinates'键即可返回这些值，例如

df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]

在这种情况下返回的对象将是包含两个坐标值的Python列表对象。如果您只想要其中一个值，则可以对列表进行切片，例如

df2.loc[1921,'Geo']['coordinates'][0]
39.11890951

这个工作流程比将字典转换为字符串，解析字符串以及重新捕获坐标值要容易得多。

因此，假设您要创建一个名为“geo_coord0”的新列，其中包含第一个位置的所有坐标（如上所示）。您可以使用以下内容：

df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]

这使用Python列表推导来迭代df2['Geo']列中的所有条目，并且对于每个条目，它使用我们上面使用的相同语法来返回第一个坐标值。然后，它会将这些值分配给df2中的新列。

有关上述数据结构的更多详细信息，请参阅Python documentation on data structures。

从pandas数据框中的列中删除字符

1 个答案: