我有一个简单的文本文件,其中包含" transactions"。
第一行是列名,例如" START_TIME"," END_TIME"," SIZE" ..约~100列名称。
文件中的列名没有引号。
我想使用Spark,将此文件转换为数据框,列名为
然后从文件中删除所有列但某些特定列。
我在将文本文件转换为数据框时遇到了一些麻烦。
到目前为止,这是我的代码:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# Load relevant objects
sc = SparkContext('local')
log_txt = sc.textFile("/path/to/text/file.txt")
sqlContext = SQLContext(sc)
# Construct fields with names from the header, for creating a DataFrame
header = log_txt.first()
fields = [StructField(field_name, StringType(), True)
for field_name in header.split(',')]
# Only columns\fields 2,3,13,92 are relevant. set them to relevant types
fields[2].dataType = TimestampType() # START_TIME in yyyymmddhhmmss format
fields[3].dataType = TimestampType() # END_TIME in yyyymmddhhmmss
fields[13].dataType = IntegerType() # DOWNSTREAM_SIZE, in bytes
fields[92].dataType = BooleanType() # IS_CELL_CONGESTED, 0 or 1
schema = StructType(fields) # Create a schema object
# Build the DataFrame
log_txt = log_txt.filter(lambda line: line != header) # Remove header from the txt file
temp_var = log_txt.map(lambda k: k.split("\t"))
log_df = sqlContext.createDataFrame(temp_var, schema) # PROBLEMATIC LINE
我遇到的问题是最后一行,我担心在最后一步之前我错过了一些步骤。
您能帮我确定缺少哪些步骤吗?
最后一行代码会产生很多错误。 如果需要,将在帖子中更新它们。
文件格式为(2行示例)
TRANSACTION_URL,RESPONSE_CODE,START_TIME,END_TIME,.... <more names>
http://www.google.com<\t seperator>0<\t seperator>20160609182001<\t seperator>20160609182500.... <more values>
http://www.cnet.com<\t seperator>0<\t seperator>20160609192001<\t seperator>20160609192500.... <more values>
此外,有人可以帮助我在构建数据框后从数据框中删除不需要的列吗?
由于
答案 0 :(得分:8)
我认为你有点过分思考它。 想象一下,我们有一些不太复杂的东西,例如下面的
`cat sample_data.txt`
field1\tfield2\tfield3\tfield4
0\tdog\t20160906182001\tgoogle.com
1\tcat\t20151231120504\tamazon.com
sc.setLogLevel("WARN")
#setup the same way you have it
log_txt=sc.textFile("/path/to/data/sample_data.txt")
header = log_txt.first()
#filter out the header, make sure the rest looks correct
log_txt = log_txt.filter(lambda line: line != header)
log_txt.take(10)
[u'0\\tdog\\t20160906182001\\tgoogle.com', u'1\\tcat\\t20151231120504\\tamazon.com']
temp_var = log_txt.map(lambda k: k.split("\\t"))
#here's where the changes take place
#this creates a dataframe using whatever pyspark feels like using (I think string is the default). the header.split is providing the names of the columns
log_df=temp_var.toDF(header.split("\\t"))
log_df.show()
+------+------+--------------+----------+
|field1|field2| field3| field4|
+------+------+--------------+----------+
| 0| dog|20160906182001|google.com|
| 1| cat|20151231120504|amazon.com|
+------+------+--------------+----------+
#note log_df.schema
#StructType(List(StructField(field1,StringType,true),StructField(field2,StringType,true),StructField(field3,StringType,true),StructField(field4,StringType,true)))
# now lets cast the columns that we actually care about to dtypes we want
log_df = log_df.withColumn("field1Int", log_df["field1"].cast(IntegerType()))
log_df = log_df.withColumn("field3TimeStamp", log_df["field1"].cast(TimestampType()))
log_df.show()
+------+------+--------------+----------+---------+---------------+
|field1|field2| field3| field4|field1Int|field3TimeStamp|
+------+------+--------------+----------+---------+---------------+
| 0| dog|20160906182001|google.com| 0| null|
| 1| cat|20151231120504|amazon.com| 1| null|
+------+------+--------------+----------+---------+---------------+
log_df.schema
StructType(List(StructField(field1,StringType,true),StructField(field2,StringType,true),StructField(field3,StringType,true),StructField(field4,StringType,true),StructField(field1Int,IntegerType,true),StructField(field3TimeStamp,TimestampType,true)))
#now let's filter out the columns we want
log_df.select(["field1Int","field3TimeStamp","field4"]).show()
+---------+---------------+----------+
|field1Int|field3TimeStamp| field4|
+---------+---------------+----------+
| 0| null|google.com|
| 1| null|amazon.com|
+---------+---------------+----------+
数据框需要为其遇到的每个字段指定一个类型,无论您是否实际使用该字段都取决于您。 你必须使用spark.SQL函数之一将字符串的日期转换为实际的时间戳,但不应该太难。
希望这有帮助
PS:根据您的具体情况,要制作初始数据框,请尝试:log_df=temp_var.toDF(header.split(','))