Question

我正在做一些网页抓取，这是数据的格式

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

我收到的实际字符串是以下格式

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

我感兴趣的是Course_Code，Course_Name和Grade，在这个例子中，值将是

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

我是否有某种方法可以使用正则表达式或其他技术轻松提取此信息，而不是手动解析字符串。我在1.9模式下使用jruby。

Answer 1

让我们使用Ruby的命名捕获和自我描述的正则表达式！

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A

Answer 2

只是为了好玩：

str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}

Answer 3

我是否正确地看到分隔符总是3个空格？然后就是：

serial_number, course_code, course_name, credit, grade, attendance_grade = 
  the_string.split('   ')

Answer 4

假设课程描述以外的所有内容都由单个单词组成，并且没有前导或尾随空格：

/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/

您的示例字符串将产生以下匹配组：

1.  1
2.  CA727
3.  PRINCIPLES OF COMPILER DESIGN
4.  3
5.  A
6.  M

Answer 5

这个答案不是非常惯用的Ruby，因为在这种情况下我认为清晰度比聪明更好。你真正需要做的就是解决你描述的问题是用空格分割你的行：

line = '1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]

这假设您的数据是正常的。如果没有，您将需要更加努力地调整正则表达式，并可能处理没有所需字段数的边缘情况。

后人注意事项

OP更改了输入字符串，并将分隔符修改为字段之间的单个空格。我会将原始问题的答案保留原样（包括原始输入字符串以供参考），因为它可能会帮助除了OP之外的其他问题。

Ruby使用正则表达式从字符串中提取数据

5 个答案:

后人注意事项