我有如下所示的相当长的字符串(~50k字符)
https://gist.github.com/anonymous/9de31de2e6fc9888f3debeda4698b739
我想提取总是在"'>"之间的数字(总是1或2位数字)。和"<"并将它们添加到矢量(必须按正确的顺序)。
例如:
><td class='td-val ball-8'>13</td><td class='td-val ball-8'>9</td>
会输出一个矢量,[13,9]
当我尝试在表单中执行此操作时,我甚至无法让我将字符串输入r中。
mystring <- "text here"
当我尝试按Enter键时,它会在命令行旁边显示一个+。所以我认为文中的一些符号搞砸了。
答案 0 :(得分:3)
由于您尝试解析的是HTML,因此最好使用像rvest这样的HTML解析包:
[1] 13 9 8 8 1 2 0 8 11 2 13 5 13 4 4 5 4 7 3 8 10 13 1 7 14 13 10 2 0 8
[31] 13 0 10 5 11 9 3 1 4 3 5 12 4 14 1 9 13 5 9 7 12 10 2 10 14 4 11 11 13 8
[61] 8 10 10 12 12 6 8 13 7 2 2 9 10 9 13 3 14 14 0 14 4 11 14 6 10 2 0 0 10 14
[91] 2 8 3 6 14 6 1 9 11 12 1 12 4 0 7 9 2 10 1 12 0 8 0 9 3 11 11 0 8 5
[121] 0 6 1 9 8 10 7 4 7 0 3 12 10 11 11 8 4 11 1 5 12 2 14 9 12 8 1 9 14 13
[151] 8 2 1 5 7 9 14 14 12 3 6 3 9 0 6 9 3 3 10 3 8 6 9 2 4 12 2 2 14 7
[181] 12 8 0 8 12 2 12 9 6 8 9 9 3 7 9 0 6 13 0 12 3 14 12 4 8 9 14 4 5 9
[211] 6 3 2 5 1 2 0 5 0 5 9 0 12 14 11 11 7 4 12 1 14 2 13 3 13 2 0 12 13 6
[241] 5 3 13 9 12 2 11 6 8 12 9 6 13 9 0 0 4 2 1 0 0 3 0 3 7 9 11 1 8 10
[271] 11 13 12 9 10 8 10 3 7 12 4 9 0 4 14 1 7 0 7 1 2 6 0 6 6 1 0 9 4 8
[301] 0 7 13 8 11 4 1 12 1 14 11 13 9 12 8 2 8 7 12 13 12 5 8 5 10 2 7 5 9 12
[331] 12 13 8 7 6 4 12 13 4 9 12 2 0 11 8 9 1 10 5 10 9 11 10 1 8 1 12 10 9 5
[361] 7 10 5 2 7 12 4 10 6 9 0 6 0 4 13 7 0 8 3 3 11 8 4 12 10 5 7 1 11 3
[391] 1 11 7 14 13 13 14 4 2 11 2 12 3 6 14 10 6 13 9 12 4 13 10 3 9 11 8 4 8 10
[421] 9 6 3 6 7 5 11 0 2 7 6 11 11 13 13 12 7 9 6 9 5 12 14 3 13 10 1 2 7 1
[451] 14 1 0 7 8 13 6 3 9 12 2 2 2 7 11 1 2 14 6 13 11 3 6 11 5 9 0 9 13 10
[481] 11 13 3 12 12 3 7 6 5 14 3 9 10 6 13 5 7 4 5 12 8 14 5 6 8 7 0 0 2 1
[511] 1 9 13 13 5 6 10 8 0 2 3 4 4 5 14 13 5 2 2 4 6 5 9 6 14 8 4 12 4 6
[541] 9 1 4 2 4 9 1 7 1 10 0 1 1 8 6 5 8 4 9 11 14 2 3 8 2 11 3 7 11 2
[571] 4 9 5 3 4 1 4 8 13 4 8 8 1 7 2 7 3 11 13 1 13 7 9 3 7 7 4 12 9 14
[601] 11 9 2 12 12 14 10 4 12 11 12 10 14 3 11 6 12 3 6 3 11 8 10 2 6 3 1 11 2 6
[631] 0 8 12 5 5 3 6 2 14 11 7 14 14 8 11 2 7 0 10 2 0 4 8 9 8 3 2 13 4 10
[661] 2 5 13 2 2 12 12 0 10 4 1 5 13 3 10 3 11 2 5 3 9 6 11 0 8 12 0 11 2 11
[691] 7 8 1 3 4 14 4 4 9 5 12 7 6 9 12 13 2 11 1 11 12 0 4 6 10 8 5 14 7 6
[721] 4 7 2 5 2 14 3 8 10 6 14 7 14 3 2 6 5 0 3 0 12 0 12 3 5 5 8 5 14 6
[751] 10 14 5 2 3 11 3 4 3 11 4 2 0 11 11 13 4 0 6 14 2 6 9 10 4 9 5 7 1 13
[781] 8 3 13 3 10 4 8 1 3 11 2 8 5 10 7 6 10 14 14 2 2 12 8 4 13 7 11 13 4 5
[811] 7 2 3 8 14 3 9 12 6 2 6 0 3 5 8 8 0 14 13 13 7 10 9 6 1 0 4 8 6 8
[841] 14 1 9 0 9 2 7 10 8 5 10 7 1 8 2 13 3 1 8 12 12 2 5 6 3 9 4 5 4 13
[871] 6 3 10 7 9 2 1 12 1 11 0 10 0 11 8 8 0 7 0 11 10 3 14 6 9 11 11 0 12 1
[901] 10 13 1 7 7 2 0 3 13 9 2 4 12 3 0 11 1 8 8 13 12 6 8 13 8 1 13 11 2 9
[931] 11 8 10 8 3 14 6 14 7 6 7 10 3 11 3 13 11 3 9 13 8 10 8 7 12 4 11 12 12 9
[961] 6 10 2 8 13 7 11 5 7 12 10 14 1 6 7 6 7 2 3 5 13 6 10 9 5 2 0 1 11 8
[991] 9 5 1 3 3 1 12 1 13 2 14 5 7 1 10 9 0 9 11 10 6 2 7 12 10 6 2 10 13 4
[1021] 9 9 14 4 4 5 7 13 13 13 6 7 12 1 6 11 12 14 4 11 6 4 10 0 9 12 10 10 13 8
[1051] 3 3 0 8 5 14 10 3 7 5 0 14 5 6 10 14 7 4 8 9 1 6 14 1 14 5 5 14 4 11
[1081] 12 14 9 13 14 13 2 13 11 9 14 2 1 9 8 11 13 11 14 13 3 4 9 6 9 6 10 13 1 12
[1111] 10 14 11 5 8 9 3 5 6 14 1 11 10 12 7 7 2 13 13 12 12 4 3 14 6 4 2 5 9 4
[1141] 14 11 6 4 11 6 4 4 8 2 2 5 14 1 7 11 8 9 11 11 10 6 14 3 0 3 8 8 14 13
[1171] 10 6 10 4 9 12 0 9 2 9 13 12 1 12 3 5 5 3 12 2 1 5 1 0 10 7 3 10 14 13
[1201] 11 8 0 10 12 9 4 5 4 8 5 6 2 11 7 5 5 8 4 9 9 10 14 3 7 9 1 9 9 8
[1231] 1 8 11 5 2 4 9 14 14 6 10 7 4 14 6 5 1 4 3 8 13 10 5 1 8 8 6 8 7 1
[1261] 14 4 4 7 2 12 10 8 10 5 6 7 2 3 5 13 1 2 9 8 5 14 1 11 9 5 8 12 13 0
[1291] 4 2 0 8 8 2 5 3 13 11 5 11 14 14 9 12 4 5 9 3 13 14 1 5 10 4 9 6 5 8
[1321] 7 5 7 3 14 8 4 8 4 6 5 8 11 0 14 13 2 13 12 13 3 4 7 8 11 4 14 12 3 6
[1351] 11 8 8 9 6 7 4 3 10 9 2 9 12 12 0 1 10 9 8 0 12 9 3 14 13 7 8 12 10 9
[1381] 10 10 2 11
返回
import random
from datetime import datetime
from time import sleep
# Randomly select a time between 20 to 30 minutes
# before sleeping.
random_time_duration = random.randint(1200,1800)
# Randomly sleep between 60 to 120 seconds.
sleep_duration = random.randint(60,120)
# This is the start time of of loop used to track
# how much time has passed.
old_time = datetime.now()
while True:
# Check if the randomly selected duration has
# passed before running your code block.
if (datetime.now()-old_time).total_seconds() > random_time_duration:
sleep(sleep_duration)
# Reset all the time variables so the loop works
# again.
random_time_duration = random.randint(1200,1800)
sleep_duration = random.randint(60,120)
old_time = datetime.now()
else:
# Put your code in here.
pass
答案 1 :(得分:2)
您可以使用readLines
从url
导入字符串,方法是点击Raw
按钮。
mystring <- readLines("https://gist.githubusercontent.com/anonymous/9de31de2e6fc9888f3debeda4698b739/raw/c07c2d6c6f00060806b15ec57ed06d4a4e0d9d74/gistfile1.txt")
使用一些正则表达式,如下所示,应该为您提供所需的所有数字:
library(stringr)
num <- gsub(">|<", "", str_extract_all(mystring, ">\\d+<", simplify = T))
head(as.vector(num))
[1] "13" "9" "8" "8" "1" "2"