我最近发现,维基百科的Wikiprojects
是根据discipline
(https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline)进行分类的。如链接所示,它有34个学科。
我想知道是否有可能获得与每个wikipedia disciplines
相关的所有维基百科文章。
例如,考虑WikiProject Computer science
。是否可以使用WikiProject Computer science
类别获取所有与计算机科学相关的维基百科文章?如果是这样,是否有与之相关的数据转储,或者是否有其他方法来获取这些数据?
我当前正在使用python(即pywikibot
和pymediawiki
)。但是,我也很高兴收到其他语言的答案。
很高兴在需要时提供更多详细信息。
答案 0 :(得分:2)
您可以使用API:Categorymembers获取子类别和页面的列表。将“ cmtype”参数设置为“ subcat”以获取子类别,将“ cmnamespace”参数设置为“ 0”以获取文章。
您还可以从数据库中获取列表(categorylinks table中的类别层次结构信息和page table中的文章信息)
答案 1 :(得分:2)
按照我的建议并添加到@arash的答案中,您可以使用Wikipedia API来获取Wikipedia数据。这是有关如何执行此操作的说明的链接,API:Categorymembers#GET_request
正如您评论的那样,您需要使用程序来获取数据,下面是JavaScript中的示例代码。它将从Category:WikiProject_Computer_science_articles
获取前500个名称,并显示为输出。您可以根据以下示例转换您选择的语言:
// Importing the module
const fetch = require('node-fetch');
// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
console.log(t.query.categorymembers[i].title);
}
});
要将数据写入文件,可以执行以下操作:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = [];
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles[i] = title;
}
fs.writeFileSync('pathtotitles\\titles.txt', titles);
});
上面的代码将数据存储在,
分隔的文件中,因为我们在那里使用了JavaScript数组。如果要在每行中存储而不使用逗号,则需要这样做:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";
//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
fs.writeFileSync('pathtotitles\\titles.txt', titles);
});
通过使用cmlimit
,我们无法获取超过500个标题,因此我们需要使用cmcontinue
来检查和获取下一页...
尝试下面的代码,该代码可提取特定类别的所有标题并进行打印,并将数据附加到文件中:
//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";
// Method to fetch and append the data to a file
var fetchTheData = async (url, index) => {
return await fetch(url).then(res => res.json()).then(data => {
// Getting the length of the returned array
let len = data.query.categorymembers.length;
// Initializing an empty string
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = data.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
// Appending to the file
fs.appendFileSync('pathtotitles\\titles.txt', titles);
// Handling an end of error fetching titles exception
try {
return data.continue.cmcontinue;
} catch(err) {
return "===>>> Finished Fetching...";
}
});
}
// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
// Getting the next page token
let nextPage = await fetchTheData(url);
for(let i=1;i<=14;i++) {
await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
// Constructing the next page URL with next page token and sending the fetch request
nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
}
}
// Calling to begin extraction
constructNextPageURL(url);
希望对您有帮助...