Question

我正在尝试使用python将csv读取到rdd（SPARK）。我遇到的问题是使用带逗号作为分隔符的split函数。只要每列中没有逗号，这样就可以正常工作。如果有逗号，则逗号将每列拆分为多列。

e.g。

empid，emp title，emp desc，college 123，开发人员，开发人员的角色是使用C，C ++等语言开发软件，college1

nameData

在上面的示例中，emp desc也被分成大学，请在阅读数据集时告诉我如何处理每列中的逗号？

Answer 1

实际上不可能知道哪些逗号应该是分隔符，哪些不是没有附加信息。您最好的选择可能是更改分隔符或确保所有非分隔符逗号都被＆＃34;转义＆＃34;在某种程度上进入。

使用转义的解决方案：

前提是所有非分隔符逗号都以某些为前缀，例如＆＃34; \，＆＃34;然后你可以用逗号分割并加入任何以escape \

开头的条目

line = '123, developer, the role of developer is to develop softwares using languages such as C\\, C++ etc, college1'

temp = line.strip().split(',')

i=0
while i < len(temp)-1:
    if temp[i][-1] == '\\':
        temp[i:i+2] = [','.join(temp[i:i+2])]
    else:
        temp[i] = ','.join(temp[i].split('\\,'))
        i += 1

empid, emp_title, emp_desc, college = temp
print('empid: '+empid+'\nemp_title: '+emp_title+'\nemp_desc: '+emp_desc+'\ncollege: '+college)

输出：

empid: 123
emp_title:  developer
emp_desc:  the role of developer is to develop softwares using languages such as C, C++ etc
college:  college1

使用其他信息的解决方案：

另一方面，如果由于某些原因你不能使用转义为非分隔符逗号，那么你的下一个最佳选择就是强加额外的信息。例如，如果您有理由相信只有 emp_desc 变量将具有非分隔符逗号，那么您可以始终执行以下操作：

temp = line.strip().split(",")
empid = temp[0]
emp_title = temp[1]
emp_desc = temp[2:len(temp)-1]
college = temp[-1]

使用sc.textFile（APACHE SPARK RDD）阅读时转义逗号

1 个答案:

使用sc.textFile（APACHE SPARK RDD）阅读时转​​义逗号

1 个答案:

使用sc.textFile（APACHE SPARK RDD）阅读时转义逗号