我必须将文本字符向量中的所有参数转换为易于引用的格式:使用R的具有3列(演示者,时间和文本)的列表。
例如,主持人应该是
# HARPER'S
时间应该是
# [Day 1, 9:00 A.M.]
并且文本应该是参数中的其余部分。
我需要计算文本中的参数数量(每个参数的开头
# HARPER'S [Day 1, 9:00 A.M.]
是一个参数)。我想创建一个名为“ arguments”的新列表对象,该列表的每个元素都是一个包含三个元素(“ presenter”,“ time”和“ text”)的子列表。
然后将演示者名称和时间提取到两个字符向量中(也删除缩进),并将“ presenter”元素和“ time”元素保留在该参数的子列表中。
This is the text:
[1] "HARPER'S [Day 1, 9:00 A.M.]: When the computer was young, the word hacking was"
[2] "used to describe the work of brilliant students who explored and expanded the"
[3] "uses to which this new technology might be employed. There was even talk of a"
[4] "\"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark"
[5] "connotations, suggestion the actions of a criminal. What is the hacker ethic,"
[6] "and does it survive?"
[7] ""
[8] "ADELAIDE [Day 1, 9:25 A.M.]: the hacker ethic survives, and it is a fraud. It"
[9] "survives in anyone excited by technology's power to turn many small,"
[10] "insignificant things into one vast, beautiful thing. It is a fraud because"
[11] "there is nothing magical about computers that causes a user to undergo"
[12] "religious conversion and devote himself to the public good. Early automobile"
[13] "inventors were hackers too. At first the elite drove in luxury. Later"
[14] "practically everyone had a car. Now we have traffic jams, drunk drivers, air"
[15] "pollution, and suburban sprawl. The old magic of an automobile occasionally"
[16] "surfaces, but we possess no delusions that it automatically invades the"
[17] "consciousness of anyone who sits behind the wheel. Computers are power, and"
[18] "direct contact with power can bring out the best or worst in a person. It's"
[19] "tempting to think that everyone exposed to the technology will be grandly"
[20] "inspired, but, alas, it just ain't so."
[21] ""
[22] "BRAND [Day 1, 9:54 A.M.]: The hacker ethic involves several things. One is"
[23] "avoiding waste; insisting on using idle computer power -- often hacking into a"
[24] "system to do so, while taking the greatest precautions not to damage the"
[25] "system. A second goal of many hackers is the free exchange of technical"
[26] "information. These hackers feel that patent and copyright restrictions slow"
[27] "down technological advances. A third goal is the advancement of human"
[28] "knowledge for its own sake. Often this approach is unconventional. People we"
[29] "call crackers often explore systems and do mischief. The are called hackers by"
[30] "the press, which doesn't understand the issues."
[31] ""
[32] "KK [Day 1, 11:19 A.M.]: The hacker ethic went unnoticed early on because the"
[33] "explorations of basement tinkerers were very local. Once we all became"
[34] "connected, the work of these investigations rippled through the world. today"
[35] "the hacking spirit is alive and kicking in video, satellite TV, and radio. In"
[36] "some fields they are called chippers, because the modify and peddle altered"
[37] "chips. Everything that was once said about \"phone phreaks\" can be said about"
[38] "them too."
我试图计算参数的长度。
length(grep("^([A-Z]+'*[A-Z]*)", text_data))
arguments = list(presenters = regmatches(text_data, regexpr("^([A-Z]+'*[A-Z]*)", text_data)), time = regmatches(text_data, regexpr("(\\[.*\\])", text_data)), text = regmatches(paste(unlist(text_data), collapse =" ")), regexpr("(:\\s.*)", regmatches(paste(unlist(text_data), collapse =" "))))
text_data
列表“参数”的长度应为55。
第一个参数的输出示例为
$presenter
[1] "HARPER'S"
$time
[1] "[Day 1, 9:00 A.M.]"
$text
[1] ": When the computer was young, the word hacking was used to describe the work of brilliant students who explored and expanded the uses to which this new technology might be employed. There was even talk of a \"hacker ethic.\" Somehow, in the succeeding years, the word has taken on dark connotations, suggestion the actions of a criminal. What is the hacker ethic, and does it survive?"
非常感谢您的帮助。
答案 0 :(得分:1)
我建议
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<link rel="stylesheet" href="https://cdn.rawgit.com/openlayers/openlayers.github.io/master/en/v5.3.0/css/ol.css" type="text/css">
<script src="https://cdn.rawgit.com/openlayers/openlayers.github.io/master/en/v5.3.0/build/ol.js"></script>
</head>
<body>
<div id="map" class="map"></div>
<script type="text/javascript">
var view = new ol.View({
center: ol.proj.fromLonLat([10,50]),
zoom: 14
})
//Dummy coords
var coordinates = [
[10, 50],
[11, 51],
[12, 55]
];
var layerLines = new ol.layer.Vector({
source: new ol.source.Vector({
features: [new ol.Feature({
geometry: new ol.geom.LineString(coordinates),
name: 'Line'
})]
}),
style : new ol.style.Style({
stroke : new ol.style.Stroke({
strokeColor: '#ff0000',
strokeWidth: 5
})
})
});
var map = new ol.Map({
target: 'map',
layers: [
new ol.layer.Tile({
source: new ol.source.OSM()
})
],
view: view
});
map.addLayer(layerLines);
</script>
</body>
</html>
这里的要点是使用library(stringr)
data <- str_match(paste(lines, collapse="\n"), "(?sm)^([A-Z]+(?:'[A-Z]+)?)\\s+(\\[[^\\]\\[]*\\]):\\s*(.*?)(?=\n{2}|\\z)")
presenterCol <- data[[1]][,2]
timeCol <- data[[1]][,3]
textCol <- data[[1]][,4]
将行与换行符连接在一起,以便我们可以在单个多行字符串上运行正则表达式,以获取1)演示者的详细信息,2)日期放在方括号内和3)其余文本,直到空白行或整个字符串的结尾。
请参见regex demo。
正则表达式详细信息
paste(lines, collapse="\n")
-(?sm)
修饰符使s
与换行符匹配,而.
使m
与换行符匹配^
-一行的开头^
-第1组:1个以上的大写字母,然后是([A-Z]+(?:'[A-Z]+)?)
和1个以上的大写字母的可选序列'
-超过1个空格\\s+
-第2组:(\\[[^\\]\\[]*\\])
,除了[
和[
以外的0个或更多字符,然后是]
]
-冒号:
-超过0个空格\\s*
-尽可能少的0个字符,直到第一个... (.*?)
-(一个正向的超前查询,需要在当前位置的右边)两个换行符或整个字符串的结尾。