打印非Ascii列值python- spark

时间:2018-11-10 14:22:54

标签: python pyspark ascii

对于python和spark很新,我编写了udf来删除字符串中存在的非ascii字符。

使其与操作一起打印错误值的最有效方法是什么? (错误值是包含非ASCII字符的单元格)

代码:

for(country in unique(Plot_df$country)) {

  # YOU NEVER *REALLY* USE THIS VECTOR JUST ONE ELEMENT FROM IT

  # Color settings: colorblind-friendly palette
  c(
    "#999999", "#E69F00", "#56B4E9", "#009E73", 
    "#F0E442", "#0072B2", "#D55E00", "#CC79A7"
  ) -> cols

  # carve out the data for the plot
  country_df <- Plot_df[Plot_df$country == country,]

  # Plotting code where DATA, YEAR, etc need to be handed the right vectors
  ggplot() +
    geom_line(
      data = country_df, # THIS IS WHAT YOU FORGOT
      aes(year, emissions, group = sector), 
      color = cols[1] # WHY [1] IF YOU DEFINED A VECTOR
    ) +
    xlim(1990, 2016) + # SHOULD LIKELY USE scale_x_… and set limits there + expand=c(0,0) insteasd
    ylim(-50, 50) + # SAME
    labs(
      x = "Year", y = "CO2 emissions", 
      title = sprintf("Emissions for %s", country)
    ) + 
    theme(plot.margin = margin(.5, .5, .5, .5, "cm")) -> p # THERE IS A margin() function

  print(p) # it won't print without print()

  # Save plot, where the file name automatically gets a country name suffix
  ggsave(
    plot = p, 
    filename = sprintf("./FILENAME-%s.png", country), # I PREFER sprintf
    width = 6.5, 
    height = 6
  )
}

1 个答案:

答案 0 :(得分:-1)

在大多数情况下有效的简单解决方案正在运行计算,以达到以下目的:

# in python 3
def check_ascii(string):
    if(not string.isascii()):
        return string
    else:
        return None

def check_ascii_in_python_2(string):
     if(not all(ord(char) < 128 for char in string)):
         return string
     else:
         return None

all_strings_with_non_ascii_chars = df.select('_c1','_c2').withColumn('check', check_ascii(df._c1)).filter('check is not null').select('check')
all_strings_with_non_ascii_chars.show()