在PySpark

时间:2018-05-03 21:06:53

标签: python-2.7 apache-spark pyspark apache-spark-sql pyspark-sql

我正在尝试从非常大的日志文件中搜索特定行。我能够搜索该行。

现在使用该行空间我想创建一个数据帧,我无法做到这一点。我尝试了下面的代码但无法实现。

from pyspark import SparkConf,SparkContext
from pyspark import  SQLContext
from pyspark.sql.types import *
from pyspark.sql import *

conf=SparkConf().setMaster("local").setAppName("invparsing")
sc=SparkContext(conf=conf)
sql=SQLContext(sc)
def f(x) :print(x)

data_frame_schema=StructType([
    StructField("Typeof",StringType()),
    #StructField("Produt_mod",StringType()),
    #StructField("Col2",StringType()),
    #StructField("Col3",StringType()),
    #StructField("Col4",StringType()),
    #StructField("Col5",StringType()),
])
path="C:/rk/IBMS/inv.log"

lines=sc.textFile(path)
NodeStr=lines.filter(lambda x:'Node :RBS6301' in x).map(lambda x:x.split(" +"))
NodeStr.foreach(f)
Nodedf=sql.createDataFrame(NodeStr,data_frame_schema)
Nodedf.show(truncate=False)

现在,我在这里得到输出 - 只有一个字符串。 O想要在空间的基础上分割价值。

[u'Node: RBS6301         XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)']
+-------------------------------------------------------------+
|Typesof                                                      |  
+-------------------------------------------------------------+ 
|Node: RBS6301         XP10521/26   R30F   L17A.4-6   (C17.0_LSV_PS4)
+-------------------------------------------------------------+

预期产出:

Typeof      Produt_mod  Col2          Col3    Col4        COL5 
Node     RBS6301       XP10521/26    R30F    L17A.4-6    C17.0_LSV_PS4

1 个答案:

答案 0 :(得分:2)

你犯的第一个错误是:


map.addLayer({
        id: 'zip-codes',
        type: 'line',
        source: {
          type: 'vector',
          url: 'mapbox://<tilesetid>',
        },
        'source-layer': 'original',
        layout: {
          'line-join': 'round',
          'line-cap': 'round',
        },
        paint: {
          'line-color': 'green',
          'line-width': 10,
          'fill-color': 'red',
        },
      });

lambda x:x.split(" +") 采用常量字符串而不是正则表达式。要在空格上拆分,您应该省略分隔符

str.split

完成后,您只需过滤并转换为lines = sc.parallelize(["Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)"]) lines.map(lambda s: s.split()).first() # ['Node:', 'RBS6301', 'XP10521/26', 'R30F', 'L17A.4-6', '(C17.0_LSV_PS4)']

即可
DataFrame

df = lines.map(lambda s: s.split()).filter(lambda x: len(x) == 6).toDF( ["col1", "col2", "col3", "col4", "col5", "col6"] ) df.show() # +-----+-------+----------+----+--------+---------------+ # | col1| col2| col3|col4| col5| col6| # +-----+-------+----------+----+--------+---------------+ # |Node:|RBS6301|XP10521/26|R30F|L17A.4-6|(C17.0_LSV_PS4)| # +-----+-------+----------+----+--------+---------------+

filter