Question

我有一个由100k唯一数据记录组成的数据集，用于对代码进行基准测试，我需要测试具有500万条唯一记录的数据，我不想生成随机数据。我想使用我拥有的100k数据记录作为基础数据集，并生成与其类似的剩余数据，并为某些列提供唯一值，如何使用python或Scala执行此操作？

这是示例数据

latitude   longitude  step count
25.696395   -80.297496  1   1
25.699544   -80.297055  1   1
25.698612   -80.292015  1   1
25.939942   -80.341607  1   1
25.939221   -80.349899  1   1
25.944992   -80.346589  1   1
27.938951   -82.492018  1   1
27.944691   -82.48961   1   3
28.355484   -81.55574   1   1

每对纬度和经度在生成的数据中应该是唯一的，我应该能够为这些列设置最小值和最大值

Answer 1

您可以使用R轻松生成符合正态分布的数据，您可以按照以下步骤进行操作

#Read the data into a dataframe
library(data.table)
data = data = fread("data.csv", sep=",", select = c("latitude", "longitude"))

#Remove duplicate and null values
df = data.frame("Lat"=data$"latitude", "Lon"=data$"longitude")
df1 = unique(df[1:2])
df2  <- na.omit(df1)

#Determine the mean and standard deviation of latitude and longitude values
meanLat = mean(df2$Lat)
meanLon = mean(df2$Lon)
sdLat = sd(df2$Lat)
sdLon = sd(df2$Lon)

#Use Normal distribution to generate new data of 1 million records

newData = list()
newData$Lat = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLat + meanLat)
newData$Lon = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLon + meanLon)

finalData = rbind(df2,newData)

now final data contains both old records and new records

将finalData数据帧写入CSV文件，您可以从Scala或python

中读取它

Answer 2

如果您只想在scala中生成数据，请尝试这种方式。

val r = new scala.util.Random   //create scala random object
val new_val = r.nextFloat() // for generating next random float between 0 to 1 for every call

并将此new_val添加到数据中的纬度最大值。独特的纬度无论如何都会使对独特。

您可以使用Spark with Scala尝试此选项。

val latLongDF = ss.read.option("header", true).option("delimiter", ",").format("csv").load(mypath)   // loaded your sample data in your question as Dataframe
+---------+----------+----+-----+
| latitude| longitude|step|count|
+---------+----------+----+-----+
|25.696395|-80.297496|   1|    1|
|25.699544|-80.297055|   1|    1|
|25.698612|-80.292015|   1|    1|


val max_lat = latLongDF.select(max("latitude")).first.get(0).toString().toDouble // got max latitude value

val r = new scala.util.Random // scala random object to get random numbers

val nextLat = udf(() => (28.355484 + 0.000001 + r.nextFloat()).toFloat) // udf to give random latitude more than the existing maximum latitude

在上面的行toFloat轮次浮动，这可能导致重复值。如果您的纬度中有更多的十进制值（大于6），那么删除它以获得完整的随机值。或者在经度上使用相同的方法也可以获得更好的独特性。

val new_df = latLongDF.withColumn("new_lat", nextLat()).select(col("new_lat").alias("latitude"),$"longitude",$"step",$"count").union(latLongDF) // creating new dataframe and Union with existing dataframe

新生成的数据样本。

+----------+----------+----+-----+
|latitude| longitude|step|count|
+----------+----------+----+-----+
| 28.446129|-80.297496|   1|    1|
| 28.494934|-80.297055|   1|    1|
| 28.605234|-80.292015|   1|    1|
| 28.866316|-80.341607|   1|    1|

使用现有数据集作为基础数据集生成数据

2 个答案: