Question

我有几乎1 gb的文件存储了几乎.2百万条推文。并且，巨大的文件大小显然会带来一些错误。错误显示为 AttributeError: 'int' object has no attribute 'items'。当我尝试运行此代码时会发生这种情况。

 raw_data_path = input("Enter the path for raw data file: ")
 tweet_data_path = raw_data_path



 tweet_data = []
 tweets_file = open(tweet_data_path, "r", encoding="utf-8")
 for line in tweets_file:
   try:
    tweet = json.loads(line)
    tweet_data.append(tweet)
   except:
    continue


    tweet_data2 = [tweet for tweet in tweet_data if isinstance(tweet, 
   dict)]



   from pandas.io.json import json_normalize    
tweets = json_normalize(tweet_data2)[["text", "lang", "place.country",
                                     "created_at", "coordinates", 
                                     "user.location", "id"]]

可以找到一个解决方案，其中可以跳过发生此类错误的行，并继续其余行。

Answer 1

这里的问题不是数据中的行，而是tweet_data本身。如果你检查你的tweet_data，你会发现另外一个'int'数据类型的元素（假设你的tweet_data是一个字典列表，因为它只需要“dict或dicts列表”）。 p>

您可能需要检查推文数据以删除其他字典值。

我能够使用以下示例复制json_normalize document：

工作示例：

from pandas.io.json import json_normalize
data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]},
       ]
json_normalize(data)

<强>输出：

显示数据框

重现错误：

from pandas.io.json import json_normalize
data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]},
       1  # *Added an integer to the list*
       ]
result = json_normalize(data)

错误：

AttributeError: 'int' object has no attribute 'items'

如何修剪“tweet_data”： 不需要，如果您按照以下更新

在规范化之前，请执行以下操作：

tweet_data = [tweet for tweet in tweet_data if isinstance(tweet, dict)]

更新:(对于foor循环）

for line in tweets_file: try: tweet = json.loads(line) if isinstance(tweet, dict): tweet_data.append(tweet) except: continue

Answer 2

最终的代码形式如下：

tweet_data_path = raw_data_path

 tweet_data = []
 tweets_file = open(tweet_data_path, "r", encoding="utf-8")

for line in tweets_file:
   try:
      tweet = json.loads(line)
      if isinstance(tweet, dict): 
         tweet_data.append(tweet)
      except: 
         continue

这清除了可能阻碍导入熊猫数据帧的属性错误的所有可能性。

将twitter数据导入pandas时跳过属性错误

2 个答案: