PySpark无法正确读取CSV

时间:2019-05-22 14:06:29

标签: python pandas csv pyspark

我正在使用df.to_csv("preprocessed_data.csv")从具有318477行的Pandas数据框中将数据保存到csv文件中。当我使用以下命令将文件加载到另一个笔记本中时:

df = pd.read_csv("preprocessed_data.csv")
len(df)

# out: 318477

行数符合预期。但是,当我尝试使用PySpark加载数据集时:

spark_df = spark.read.format("csv")
                     .option("header", "true")
                     .option("mode", "DROPMALFORMED")
                     .load("preprocessed_data.csv")
spark_df.count()

# out: 6422020

df_test = spark.sql("SELECT * FROM csv.`preprocessed_data.csv`")
df_test.count()

# out: 6422020

行数不正确。它读取的行数6422020是csv文件中的行数。由于存在行的内容跨越多行(即https://imgur.com/a/qWd9jtq

如何解决此问题?我是否需要以某种方式保存csv且在任何文本中都没有换行符,还是可以更具体地在PySpark中指定csv读数?

这是我上一个问题的继续,现在我对这个问题有了更多的了解link

CSV文件中的行:

120,teacher industrial design technology mabel park state high school,teach queensland,2018-10-07,brisbane,southern suburbs logan,education training,teaching secondary,mabel park state high school invites applications for a industrial design and technology teacher,,0,30,,0.0,0.03003003003003003
121,fabricatorinstaller,workplace access safety,2018-10-07,melbourne,bayside south eastern suburbs,trades services,welders boilermakers,trade qualified person with skills in welding and fabrication to assist in the manufacturing and installation of our custom height safety products,"<p>&nbsp;</p>
        <p><strong><em>*&nbsp; Secure long term role with genuine career path to supervisor</em></strong></p>
        <p><strong><em>*&nbsp; Competitive hourly rate with regular opportunity for overtime</em></strong></p>
        <p><strong><em>*&nbsp; Full on-the-job training</em></strong></p>
        <p><strong>About the&nbsp;role</strong></p>
        <p>Having recently won a significant new national contract we are looking for another trade qualified person with welding and fabrication skills to help manage increased demands on our production and installation departments.&nbsp; This role will
          see you involved in both manufacturing and on-site installation and there is a genuine career path to supervisor if that is your goal.&nbsp; Initially your role will require you to:-</p>
        <ul>
          <li>read and interpret drawings&nbsp;</li>
          <li>fabricate and assemble orders as required</li>
          <li>provide input to enhance factory processes</li>
          <li>pack&nbsp;and dispatch orders</li>
          <li>perform on-site installations (full training will be given)</li>
        </ul>
        <p><strong>About you</strong></p>
        <p>This role is ideal for a trade qualified person&nbsp;(welder, boilermaker, fabricator etc) with good hands-on skills who will enjoy&nbsp;dividing their time between&nbsp;factory/manufacturing and on-site installations.&nbsp; Because installations
          invariably take place on the roof, physical fitness is&nbsp;essential.</p>
        <p><strong>What we offer</strong></p>
        <ul>
          <li>A secure, long-term role with a successful, well-established organisation</li>
          <li>Full, ongoing on-the-job training</li>
          <li>Opportunity for career progression to supervisor for the right person</li>
          <li>Opportunity to work&nbsp;in a safe, supportive and friendly environment</li>
          <li>Competitive hourly rate with regular opportunities for overtime</li>
          <li>Occasional regional and interstate travel in response to major projects</li>
        </ul>
        <p><strong>How to apply</strong></p>
        <p>Please copy and paste the URL below into your browser (it is <em>not</em> a live link so&nbsp;must be copied and pasted).&nbsp; This will take you to our custom online application form which includes a number of screening questions&nbsp;and a
          profiling checklist which is an essential part of our application process.</p>
        <p><strong>https://exenet.expr3ss.com/jobDetails?selectJob=296&amp;</strong></p>
        <p>If you have any difficulties or would like more information please email <a class=""_2L3qcJ0"" data-contact-match=""true"" href=""mailto:gayle@exhr.com.au"">gayle@exhr.com.au</a> or phone <a class=""_2hhDNI-"" data-contact-match=""true"" href=""tel:0468 336 224"">0468 336 224</a>.</p>",0,30,full time,0.0,0.03003003003003003
122,boilermaker,rpm contracting qld pl,2018-10-07,brisbane,southern suburbs logan,trades services,welders boilermakers,perm rate 30 structural steel fab weld out located southside full time hours ongoing work ot modern clean facility offering great conditions,"<p>One of Australia's best engineering workshops is hiring!</p>
        <p>They have ongoing, rolling projects and need good people now.</p>
        <p>They are partnered with state and federal governments, international minerals and energy companies, and other market leading entities.</p>
        <p>The workshop is state of the art, clean, and well-managed. There is a genuine focus on the safety and wellbeing of their people.</p>
        <p>The facility and conditions are truly exceptional.</p>
        <p>Secure and long term positions are on offer for forward-thinking, cooperative and professional tradesmen.</p>
        <p>We are looking for qualified and/or ticketed boilermakers and 1st class welders that can offer high level trade skills.</p>
        <p>Equally important is a cooperative, team-orientated attitude and a willingness to become involved and take ownership of their important role in this company.</p>
        <p>They are building on a stable, permanent team, so candidates who step up can look forward to a secure future.</p>
        <p>The position is ongoing, offering full-time hours, exceptional conditions, and penalties.</p>
        <p>You require own car and licence, PPE and tools, relevant experience and to be available for an immediate start.</p>
        <p>Good luck and kind regards,</p>
        <p>RPM</p>",0,30,full time,0.0,0.03003003003003003



2 个答案:

答案 0 :(得分:1)

基于提供的示例,我尝试使用以下代码,该代码返回了3行:

>>> df = spark.read.csv('file:///tmp/test.csv', sep=',', multiLine=True)
>>> df.count()
3

如果它仍然不适合您,我会尝试强制熊猫使用引号和分隔符

答案 1 :(得分:1)

这是由于在Windows计算机中安装了pyspark。如果您的系统中安装了多个Pyspark实例。然后发生此问题。通过重新安装Pyspark应该可以解决此问题。