Question

我有一组多个API需要从中获取数据，需要四种不同的数据类别。然后，此数据将用于Excel中的报告目的。

我最初在Excel中创建了Web查询，但我的笔记本电脑崩溃了，因为有太多的查询需要更新。你们知道一个聪明的解决方法吗？

这是我将从中获取数据的API的示例（总共40个不同的） https://api.similarweb.com/SimilarWebAddon/id.priceprice.com/all 我需要的数据点是：

EstimatedMonthlyVisits, TopOrganicKeywords, OrganicSearchShare, TrafficSources

如何创建自动报告，根据要求查询上述数据？

非常感谢。

Answer 1

如果Excel由于需求而崩溃，并且这并不让我感到惊讶，那么您应该考虑使用Python或R来完成此任务。

install.packages("XML")
install.packages("plyr")
install.packages("ggplot2")
install.packages("gridExtra")

require("XML")
require("plyr")
require("ggplot2")
require("gridExtra")

接下来，我们需要设置我们的工作目录并解析XML文件，因此我们确保R可以访问文件中的数据。这基本上是将文件读入R.然后，为了确认R知道我们的文件是XML，我们检查类。实际上，R意识到它是XML。

setwd("C:/Users/Tobi/Documents/R/InformIT") #you will need to change the filepath on  your machine
xmlfile=xmlParse("pubmed_sample.xml")
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"

现在我们可以开始探索我们的XML了。也许我们想确认我们对Entrez的HTTP查询得出了正确的结果，就像我们查询PubMed的网站一样。我们首先查看第一个节点或根目录PubmedArticleSet的内容。我们还可以找出根有多少个子节点及其名称。此过程对应于检查XML文件中有多少条目。根节点的子节点都命名为PubmedArticle。

xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children

要查看前两个条目，我们可以执行以下操作。

# have a look at the content of the first child entry
xmltop[[1]]
# have a look at the content of the 2nd child entry
xmltop[[2]]

我们通过查看根的子节点继续探索。与根节点一样，我们可以列出子节点的名称和大小以及它们的属性。在这种情况下，子节点是MedlineCitation和PubmedData。

#Root Node's children
xmlSize(xmltop[[1]]) #number of nodes in each child
xmlSApply(xmltop[[1]], xmlName) #name(s)
xmlSApply(xmltop[[1]], xmlAttrs) #attribute(s)
xmlSApply(xmltop[[1]], xmlSize) #size

我们还可以通过这些子节点分离19个条目中的每一个。在这里，我们为第一个和第二个条目这样做：

#take a look at the MedlineCitation subnode of 1st child
xmltop[[1]][[1]]
#take a look at the PubmedData subnode of 1st child
xmltop[[1]][[2]]

#subnodes of 2nd child
xmltop[[2]][[1]]
xmltop[[2]][[2]]

条目的分离实际上只是我们，索引到XML的树结构。我们可以继续这样做，直到我们耗尽一条路径 - 或者在XML术语中，到达分支的末尾。我们可以通过子节点的数量或它们的实际名称来实现：

#we can keep going till we reach the end of a branch
xmltop[[1]][[1]][[5]][[2]] #title of first article
xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] #same command, but more readable

最后，我们可以将XML转换为更熟悉的结构 - 数据帧。由于数据和节点的格式不均匀，我们的命令会因错误而完成。因此，我们必须检查XML中的所有数据是否已正确输入到我们的数据帧中。实际上，由于为标记属性创建了单独的行，因此存在重复的行。例如，ELocationID节点有两个属性，ValidYN和EIDType。花点时间注意这种分离是如何产生重复的。

#Turning XML into a dataframe

    Madhu2012=ldply(xmlToList("pubmed_sample.xml"), data.frame) #completes with errors: "row names were found from a short variable and have been discarded"
    View(Madhu2012) #for easy checking that the data is properly formatted
    Madhu2012.Clean=Madhu2012[Madhu2012[25]=='Y',] #gets rid of duplicated rows

这是一个可以帮助您入门的链接。

http://www.informit.com/articles/article.aspx?p=2215520

如果您之前从未使用过R，那么需要稍微习惯，但这是值得的。我已经使用它几年了，与Excel相比，我看到R的表现速度比Excel快几百％到几千万。祝你好运。

从Web数据创建自动报告

1 个答案: