将数字从非常长的字符串提取到向量中

时间:2016-06-03 02:51:20

标签: r parsing html-parsing

我有如下所示的相当长的字符串(~50k字符)

https://gist.github.com/anonymous/9de31de2e6fc9888f3debeda4698b739

我想提取总是在"'>"之间的数字(总是1或2位数字)。和"<"并将它们添加到矢量(必须按正确的顺序)。

例如:

><td class='td-val ball-8'>13</td><td class='td-val ball-8'>9</td>

会输出一个矢量,[13,9]

当我尝试在表单中执行此操作时,我甚至无法让我将字符串输入r中。

mystring <- "text here"

当我尝试按Enter键时,它会在命令行旁边显示一个+。所以我认为文中的一些符号搞砸了。

2 个答案:

答案 0 :(得分:3)

由于您尝试解析的是HTML,因此最好使用像rvest这样的HTML解析包:

   [1] 13  9  8  8  1  2  0  8 11  2 13  5 13  4  4  5  4  7  3  8 10 13  1  7 14 13 10  2  0  8
  [31] 13  0 10  5 11  9  3  1  4  3  5 12  4 14  1  9 13  5  9  7 12 10  2 10 14  4 11 11 13  8
  [61]  8 10 10 12 12  6  8 13  7  2  2  9 10  9 13  3 14 14  0 14  4 11 14  6 10  2  0  0 10 14
  [91]  2  8  3  6 14  6  1  9 11 12  1 12  4  0  7  9  2 10  1 12  0  8  0  9  3 11 11  0  8  5
 [121]  0  6  1  9  8 10  7  4  7  0  3 12 10 11 11  8  4 11  1  5 12  2 14  9 12  8  1  9 14 13
 [151]  8  2  1  5  7  9 14 14 12  3  6  3  9  0  6  9  3  3 10  3  8  6  9  2  4 12  2  2 14  7
 [181] 12  8  0  8 12  2 12  9  6  8  9  9  3  7  9  0  6 13  0 12  3 14 12  4  8  9 14  4  5  9
 [211]  6  3  2  5  1  2  0  5  0  5  9  0 12 14 11 11  7  4 12  1 14  2 13  3 13  2  0 12 13  6
 [241]  5  3 13  9 12  2 11  6  8 12  9  6 13  9  0  0  4  2  1  0  0  3  0  3  7  9 11  1  8 10
 [271] 11 13 12  9 10  8 10  3  7 12  4  9  0  4 14  1  7  0  7  1  2  6  0  6  6  1  0  9  4  8
 [301]  0  7 13  8 11  4  1 12  1 14 11 13  9 12  8  2  8  7 12 13 12  5  8  5 10  2  7  5  9 12
 [331] 12 13  8  7  6  4 12 13  4  9 12  2  0 11  8  9  1 10  5 10  9 11 10  1  8  1 12 10  9  5
 [361]  7 10  5  2  7 12  4 10  6  9  0  6  0  4 13  7  0  8  3  3 11  8  4 12 10  5  7  1 11  3
 [391]  1 11  7 14 13 13 14  4  2 11  2 12  3  6 14 10  6 13  9 12  4 13 10  3  9 11  8  4  8 10
 [421]  9  6  3  6  7  5 11  0  2  7  6 11 11 13 13 12  7  9  6  9  5 12 14  3 13 10  1  2  7  1
 [451] 14  1  0  7  8 13  6  3  9 12  2  2  2  7 11  1  2 14  6 13 11  3  6 11  5  9  0  9 13 10
 [481] 11 13  3 12 12  3  7  6  5 14  3  9 10  6 13  5  7  4  5 12  8 14  5  6  8  7  0  0  2  1
 [511]  1  9 13 13  5  6 10  8  0  2  3  4  4  5 14 13  5  2  2  4  6  5  9  6 14  8  4 12  4  6
 [541]  9  1  4  2  4  9  1  7  1 10  0  1  1  8  6  5  8  4  9 11 14  2  3  8  2 11  3  7 11  2
 [571]  4  9  5  3  4  1  4  8 13  4  8  8  1  7  2  7  3 11 13  1 13  7  9  3  7  7  4 12  9 14
 [601] 11  9  2 12 12 14 10  4 12 11 12 10 14  3 11  6 12  3  6  3 11  8 10  2  6  3  1 11  2  6
 [631]  0  8 12  5  5  3  6  2 14 11  7 14 14  8 11  2  7  0 10  2  0  4  8  9  8  3  2 13  4 10
 [661]  2  5 13  2  2 12 12  0 10  4  1  5 13  3 10  3 11  2  5  3  9  6 11  0  8 12  0 11  2 11
 [691]  7  8  1  3  4 14  4  4  9  5 12  7  6  9 12 13  2 11  1 11 12  0  4  6 10  8  5 14  7  6
 [721]  4  7  2  5  2 14  3  8 10  6 14  7 14  3  2  6  5  0  3  0 12  0 12  3  5  5  8  5 14  6
 [751] 10 14  5  2  3 11  3  4  3 11  4  2  0 11 11 13  4  0  6 14  2  6  9 10  4  9  5  7  1 13
 [781]  8  3 13  3 10  4  8  1  3 11  2  8  5 10  7  6 10 14 14  2  2 12  8  4 13  7 11 13  4  5
 [811]  7  2  3  8 14  3  9 12  6  2  6  0  3  5  8  8  0 14 13 13  7 10  9  6  1  0  4  8  6  8
 [841] 14  1  9  0  9  2  7 10  8  5 10  7  1  8  2 13  3  1  8 12 12  2  5  6  3  9  4  5  4 13
 [871]  6  3 10  7  9  2  1 12  1 11  0 10  0 11  8  8  0  7  0 11 10  3 14  6  9 11 11  0 12  1
 [901] 10 13  1  7  7  2  0  3 13  9  2  4 12  3  0 11  1  8  8 13 12  6  8 13  8  1 13 11  2  9
 [931] 11  8 10  8  3 14  6 14  7  6  7 10  3 11  3 13 11  3  9 13  8 10  8  7 12  4 11 12 12  9
 [961]  6 10  2  8 13  7 11  5  7 12 10 14  1  6  7  6  7  2  3  5 13  6 10  9  5  2  0  1 11  8
 [991]  9  5  1  3  3  1 12  1 13  2 14  5  7  1 10  9  0  9 11 10  6  2  7 12 10  6  2 10 13  4
[1021]  9  9 14  4  4  5  7 13 13 13  6  7 12  1  6 11 12 14  4 11  6  4 10  0  9 12 10 10 13  8
[1051]  3  3  0  8  5 14 10  3  7  5  0 14  5  6 10 14  7  4  8  9  1  6 14  1 14  5  5 14  4 11
[1081] 12 14  9 13 14 13  2 13 11  9 14  2  1  9  8 11 13 11 14 13  3  4  9  6  9  6 10 13  1 12
[1111] 10 14 11  5  8  9  3  5  6 14  1 11 10 12  7  7  2 13 13 12 12  4  3 14  6  4  2  5  9  4
[1141] 14 11  6  4 11  6  4  4  8  2  2  5 14  1  7 11  8  9 11 11 10  6 14  3  0  3  8  8 14 13
[1171] 10  6 10  4  9 12  0  9  2  9 13 12  1 12  3  5  5  3 12  2  1  5  1  0 10  7  3 10 14 13
[1201] 11  8  0 10 12  9  4  5  4  8  5  6  2 11  7  5  5  8  4  9  9 10 14  3  7  9  1  9  9  8
[1231]  1  8 11  5  2  4  9 14 14  6 10  7  4 14  6  5  1  4  3  8 13 10  5  1  8  8  6  8  7  1
[1261] 14  4  4  7  2 12 10  8 10  5  6  7  2  3  5 13  1  2  9  8  5 14  1 11  9  5  8 12 13  0
[1291]  4  2  0  8  8  2  5  3 13 11  5 11 14 14  9 12  4  5  9  3 13 14  1  5 10  4  9  6  5  8
[1321]  7  5  7  3 14  8  4  8  4  6  5  8 11  0 14 13  2 13 12 13  3  4  7  8 11  4 14 12  3  6
[1351] 11  8  8  9  6  7  4  3 10  9  2  9 12 12  0  1 10  9  8  0 12  9  3 14 13  7  8 12 10  9
[1381] 10 10  2 11

返回

import random
from datetime import datetime
from time import sleep

# Randomly select a time between 20 to 30 minutes
# before sleeping.
random_time_duration = random.randint(1200,1800)

# Randomly sleep between 60 to 120 seconds.
sleep_duration = random.randint(60,120)

# This is the start time of of loop used to track
# how much time has passed.
old_time = datetime.now()

while True:
  # Check if the randomly selected duration has
  # passed before running your code block.
  if (datetime.now()-old_time).total_seconds() > random_time_duration:
    sleep(sleep_duration)

    # Reset all the time variables so the loop works
    # again.
    random_time_duration = random.randint(1200,1800)
    sleep_duration = random.randint(60,120)
    old_time = datetime.now()

  else:
    # Put your code in here.
    pass

答案 1 :(得分:2)

您可以使用readLinesurl导入字符串,方法是点击Raw按钮。

mystring <- readLines("https://gist.githubusercontent.com/anonymous/9de31de2e6fc9888f3debeda4698b739/raw/c07c2d6c6f00060806b15ec57ed06d4a4e0d9d74/gistfile1.txt")

使用一些正则表达式,如下所示,应该为您提供所需的所有数字:

library(stringr)
num <- gsub(">|<", "", str_extract_all(mystring, ">\\d+<", simplify = T))

head(as.vector(num))
[1] "13" "9"  "8"  "8"  "1"  "2"