Question

这是我第一次做文本挖掘项目并使用Panda。我正在尝试收集下载的实时推文（json格式）中“text”标签中的所有字符串，因此我可以对所有推文进行标记并计算高频词。以下是json格式的示例推文：

{
    "contributors": null, 
    "truncated": false, 
    "text": "Hey Don : TheCougCoach :) Want to get iPh0ne 6 for FREE? Kindly check my bi0. Thx https://t.co/c38b8vqq2O", 
    "is_quote_status": true, 
    "in_reply_to_status_id": null, 
    "id": 659549062023262209, 
    "favorite_count": 0, 
     ...... skip
     },
    "quoted_status_id": 659548944251228160, 
        "retweeted": false, 
        "coordinates": null, 
        "timestamp_ms": "1446083724872", 
        "quoted_status": {
            "contributors": null, 
            "truncated": false, 
            "text": "I understand He is a criminal but Donald has all the right to be in the discussion. https://t.co/qv3oScGA1U", 
            "is_quote_status": true, 
            "in_reply_to_status_id": null,

这是我的代码（Python 2.7 + panda 0.17.0）：

import json  
import pandas as pd
tweets_data_path = 'tweet.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue

tweets = pd.DataFrame()

tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)

print tweets['text']

print tweets['text'].astype(str) # Try to convert the panda series into strings so I can tokenize the tweets (strings after "text" in the json format) using regular expression

这是输出

0     Hey Don : TheCougCoach :) Want to get iPh0ne 6...
1     I understand He is a criminal but Donald has a...
Name: text, dtype: object

UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-126: ordinal not in range(128)

两个问题：

（1） tweets = pd.DataFrame（）

tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)

这里panda和map / lambda一起提供了一种简单的方法来获取tweet json文件中“text”之后的数据。但是，“map”仅允许匹配的列表长度，使输出未完成（以...结尾）。有没有更好的方法来编码呢？

（2）

UnicodeEncodeError: 'ascii' codec can't encode characters in position 125-126: ordinal not in range(128)

似乎输入“tweet.txt”是unicodes，所以我们遇到错误？如果是，我们应该在阅读时编码“tweet.txt”吗？实际的输入文件非常大（几GB甚至更大），那么有更有效的方法来解决这个问题吗？谢谢。

Answer 1

不要逐行加载JSON文件。 JSON模块支持一次加载文件：`

with open(tweets_data_path) as fp:
    tweets_data = json.load(fp)

现在逐步浏览tweets_data，因为您通常会逐步完成列表和决定。

关键是JSON在每个键值输入后不一定需要换行符;事实上，文本文件恰好具有这种格式，但你不应该依赖它。

至于unicode问题，我建议改用Python 3来避免这些问题 JSON module documentation for Python 2表示以下内容：

如果fp的内容使用UTF-8以外的基于ASCII的编码（例如latin-1）进行编码，则必须指定适当的编码名称。不允许基于ASCII的编码（例如UCS-2），并且应该用codecs.getreader（encoding）（fp）包装，或者简单地解码为unicode对象并传递给loads（）。

python map / lambda和ascii错误

1 个答案: