Question

我是火花的初学者我从api链接中获取了一些json数据。

import urllib2
test=urllib2.urlopen('url') 
print test

得到这个

我想将它保存为表或数据框我怎么能这样做我使用spark 2.0 请指导我

Kalyan

Answer 1

这是我成功地将 .json 数据从网络导入到 df 的方式：

from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen

spark = SparkSession.builder.getOrCreate()

url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)

Answer 2

为此您可以进行一些研究并尝试使用sqlContext。这是示例代码： -

>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()

此外，请访问此处并查看更多内容， https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html

Answer 3

除了Rakesh Kumar的回答，在spark 2.0中做到这一点的方法是：

http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources

例如，以下内容根据JSON文件的内容创建一个DataFrame：

    myImageView.post(new Runnable() {
            @Override
            public void run() {
                int requiredHeight = myImageView.getHeight(); //height is ready, but is 0 - not calculated.
                int requiredWidth = myImageView.getWidth(); //width ready


                final BitmapFactory.Options options = new BitmapFactory.Options();
                options.inJustDecodeBounds = true;
                BitmapFactory.decodeFile(pathName, options);
                int realWidth = option.outWidth;
                int realHeight = option.outHeight;

                //calculating required Height
                requiredHeight = realHeight / realWidth * requiredWidth;

                //now we have both - required height and width.

                Bitmap bm = decodeSampledBitmapFromFilePath(pathName, requiredWidth, requiredHeight);


                myImageView.setImageBitmap(bm);
            }
        });

请注意作为json文件提供的文件不是典型的JSON文件。每行必须包含一个单独的，自包含的有效JSON对象。有关更多信息，请参阅JSON Lines文本格式，也称为换行符分隔的JSON。因此，常规的多行JSON文件通常会失败。

如何在pyspark中保存从url获取的json数据

3 个答案: