Question

首先，请从http://alexandervanloon.nl/survey_oss.csv下载我的数据集，然后执行以下脚本内容以获得一些散点图：

# read data and attach it
survey <- read.table("survey_oss.csv", header=TRUE)
attach(survey)

# plot for inhabitants
png("scatterINHABT.png")
plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1)
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

# plot for inhabitants divided by 1000
png("scatterINHABT_divided.png")
plot(INHABT/1000, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1)
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

# plot for inhabitants in logarithmic scale
png("scatterINHABT_log.png")
plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1, log="x")
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

# plot for inhabitants in logarithmic scale and divided by 1000
png("scatterINHABT_log_divided.png")
plot(INHABT/1000, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", las=1, log="x")
abline(lm(OSSADP~INHABT)) # regression line (y~x)
dev.off()

正如您所看到的，在第一个散点图中，问题是R决定使用科学记数法，并且由于异常值，数据看起来很奇怪。这就是为什么我想让x轴上的居民成千上万，并且让x轴也使用对数刻度。

问题是双重的。首先，我可以通过简单地将居民除以1000来摆脱科学记数法，但这会产生与第一个图不同的平坦水平回归线。我知道有其他方法可以解决这个问题，例如Do not want scientific notation on plot axis，但我无法根据我的情况调整代码。

其次，将x轴切换到对数刻度也使回归线平坦。 Google指向https://stat.ethz.ch/pipermail/r-help/2006-January/086500.html作为可能解决方案的第一个结果，我尝试使用abline(lm(OSSADP~log10(INHABT)))建议，但这会产生垂直回归线。如果我将两者除以1000并使用对数刻度，则该线也是水平的。

我是一名社会科学家，没有任何数学和统计学背景，所以我担心我可能错过了一些明显的东西，如果是这样，我很抱歉。非常感谢你们提供任何潜在的帮助。

Answer 1

前一段时间on the R mailing list涵盖了科学记数法，但您可以控制R选择何时使用options()$scipen进行科学记数法。

options(scipen=10)
plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS")

其次，除以1000的问题在于plot和abline都没有除以千。这样就可以了：

plot(INHABT/1000, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS")
abline(lm(OSSADP~I(INHABT/1000))) # Fixed regression line.

I是必要的，因为/符号在formula中具有不同的含义。

此外，您的las参数是不必要的。

Answer 2

我使用# download your model artifacts from s3 to notebook instance !mkdir /tmp/model !cd /tmp/model && aws s3 cp s3://bucketname/prefix/blazingtext-xxx-xxx-xx-xxx/output/model.tar.gz . !cd /tmp/model && tar -xvzf /tmp/model/model.tar.gz #install gensim on notebook instance !pip install gensim #use gensim in python code import gensim, logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) from gensim.models import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format('/tmp/model/vectors.txt', binary=False) word_vectors.most_similar(positive=['woman', 'king'], negative=['man']) word_vectors.doesnt_match("breakfast cereal dinner lunch".split())这样解决了水平线的问题：

log="x"

带有plot(INHABT, OSSADP, xlab="Inhabitants", ylab="Adoption of OSS", log="x") abline(lm(OSSADP~log10(INHABT)))，而不仅仅是log10。

abline和logarithmic x-axis给出了图中的水平回归线

2 个答案: