Question

我是Spark世界中的初学者，因此我遇到的问题看似简单，但我无法解决。我在熊猫数据框中加载了文件，并且结果正确。但是，当在spark数据框中加载相同的文件时，我得到的结果多了很多行！似乎记录是混合的或类似的东西。有人可以帮我吗？

我的代码：

 # File location and type
 file_location = "listings.csv"
 # Obtain dataset
 df = spark.read.csv(file_location, header='true',
 inferSchema='false', sep=',')
 df.count()

 157849

 import pandas as pd
 ls = pd.read_csv('listings.csv',low_memory=False)
 ls.count()

 id                                              65493
 listing_url                                     65493
 scrape_id                                       65493
 last_scraped                                    65493
 name                                            65426
                                            ...  
 calculated_host_listings_count                  65493
 calculated_host_listings_count_entire_homes     65493
 calculated_host_listings_count_private_rooms    65493
 calculated_host_listings_count_shared_rooms     65493
 reviews_per_month                               52602  
 Length: 106, dtype: int64

 ls.shape
 (65493, 106)

Answer 1

您可以共享listings.csv吗？另外，尝试将代码更改为

# File location and type
 file_location = "listings.csv"
 # Obtain dataset
 df = spark.read.csv(file_location, header=True,
 inferSchema=True)
 df.count()

为什么我在Spark和python中加载相同的csv文件，结果却不同？

1 个答案: