我开发了正则表达式模式来解析科学文章中的参考书目。我们使用AMA引用风格,对于期刊引用它可以看起来像这样:
"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24(7): 3057-3067."
或没有发行号:
"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24: 3057-3067."
或只有第一页(电子号码)。
"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24(7): 3057."
或仅使用卷号(如果在打印之前):
"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24."
我的模式匹配所有这些情况并将所有数据分组(由于Java而以2斜杠转义):
(.*?)\\.(.*?)\\.(.*?)(?<year>\\d+)\\s*?;?\\s*?(?:(?<volume>\\d+))?(?:\\((?<issue>\\d+)\\))?\\s*?(?::\\s*?(?<fpage>\\d+|[A-Za-z]+\\d+))?(?:[\\-\\–](?<lpage>\\d+))?\\.
问题是作者始终在第一页和最后一页之间放置空格。我想也许这种模式可以改变以匹配这个?
"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24(7): 3057 - 3067."
这里是example,可以看到模式与此不匹配。
答案 0 :(得分:1)
正确的正则表达式是
(.*?)\.(.*?)\.(.*?)(?<year>\d+)\s*?;?\s*?(?:(?<volume>\d+))?(?:\((?<issue>\d+)\))?\s*?(?::\s*?(?<fpage>\d+|[A-Za-z]+\d+))?(?:[ ]*[\-|\–][ ]*(?<lpage>\d+))?\.
这一个https://regex101.com/r/RAdNgb/2解决了您的问题。请检查一下。