对于python和spark很新,我编写了udf来删除字符串中存在的非ascii字符。
使其与操作一起打印错误值的最有效方法是什么? (错误值是包含非ASCII字符的单元格)
代码:
for(country in unique(Plot_df$country)) {
# YOU NEVER *REALLY* USE THIS VECTOR JUST ONE ELEMENT FROM IT
# Color settings: colorblind-friendly palette
c(
"#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7"
) -> cols
# carve out the data for the plot
country_df <- Plot_df[Plot_df$country == country,]
# Plotting code where DATA, YEAR, etc need to be handed the right vectors
ggplot() +
geom_line(
data = country_df, # THIS IS WHAT YOU FORGOT
aes(year, emissions, group = sector),
color = cols[1] # WHY [1] IF YOU DEFINED A VECTOR
) +
xlim(1990, 2016) + # SHOULD LIKELY USE scale_x_… and set limits there + expand=c(0,0) insteasd
ylim(-50, 50) + # SAME
labs(
x = "Year", y = "CO2 emissions",
title = sprintf("Emissions for %s", country)
) +
theme(plot.margin = margin(.5, .5, .5, .5, "cm")) -> p # THERE IS A margin() function
print(p) # it won't print without print()
# Save plot, where the file name automatically gets a country name suffix
ggsave(
plot = p,
filename = sprintf("./FILENAME-%s.png", country), # I PREFER sprintf
width = 6.5,
height = 6
)
}
答案 0 :(得分:-1)
在大多数情况下有效的简单解决方案正在运行计算,以达到以下目的:
# in python 3
def check_ascii(string):
if(not string.isascii()):
return string
else:
return None
def check_ascii_in_python_2(string):
if(not all(ord(char) < 128 for char in string)):
return string
else:
return None
all_strings_with_non_ascii_chars = df.select('_c1','_c2').withColumn('check', check_ascii(df._c1)).filter('check is not null').select('check')
all_strings_with_non_ascii_chars.show()