如何从字符串中提取某些文本

时间:2013-12-29 16:05:18

标签: vb.net string pdf for-loop contains

如何过滤/提取字符串?

我已使用itextsharp将PDF文件转换为String,并且我将文本显示在Richtextbox1中。

然而,在Richtextbox中我不需要太多不相关的文本。 有没有办法可以根据关键词,文本的整个长度显示我想要的文字。

PDF格式文本对话后在textrichbox1中显示的文本示例:

**774**
**Bos00232940
Bos00320491
Das1234
Das3216**
RAGE*

因此关键字将是“ Bos ”,“ Das ”,“ 774 ”。并且将在richtextbox1中显示的新文本显示在下面,而不是上面的整个文本。

*Bos00232940
Bos00320491
Das1234
Das3216
774*

这是我到目前为止所拥有的。但它不起作用它仍然在richtextbox中显示整个PDF。

Public Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    Dim pdffilename As String
    pdffilename = TextBox1.Text
    Dim filepath = "c:\temp\" & TextBox1.Text & ".pdf"
    Dim thetext As String
    thetext = GetTextFromPDF(filepath)
    Dim lines() As String = System.Text.RegularExpressions.Regex.Split(thetext, Environment.NewLine)
    Dim keywords As New List(Of String)
    keywords.Add("Bos")
    keywords.Add("Das")
    keywords.Add("774")
    Dim newTextLines As New List(Of String)
    For Each line As String In lines
        For Each keyw As String In thetext

            If line.Contains(keyw) Then
                newTextLines.Add(line)
                Exit For
            End If
        Next
    Next
    RichTextBox1.Text = String.Join(Environment.NewLine, newTextLines.ToArray)
End Sub

感谢大家的帮助。下面是有效的代码,完全按照我的意愿行事。

Public Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
    Dim pdffilename As String
    pdffilename = TextBox1.Text
    Dim filepath = "c:\temp\" & TextBox1.Text & ".pdf"
    Dim thetext As String
    thetext = GetTextFromPDF(filepath)

    Dim re As New Regex("[\t ](?<w>((774)|(Bos)|(Das))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled)
    Dim Lines() As String = {thetext}
    Dim words As New List(Of String)
    For Each s As String In Lines
        Dim mc As MatchCollection = re.Matches(s)
        For Each m As Match In mc
            words.Add(m.Groups("w").Value)
        Next
    Next
    RichTextBox1.Text = String.Join(Environment.NewLine, words.ToArray)
End Sub

3 个答案:

答案 0 :(得分:1)

For Each Word As String In thetext.Split(" ")
    For Each key As String In keywords
        If Word.StartsWith(key) Then
            newTextLines.Add(Word)
            Continue For
        End If
    Next
Next

或使用LINQ:

Dim q = From word In thetext.Split(" ")
        Where keywords.Any(Function(s) word.StartsWith(s))
        Select word

RichTextBox1.Text = String.Join(Environment.NewLine, q.ToArray())

答案 1 :(得分:1)

如果事先不知道关键字但知道它们出现在哪种情况下,您可以使用Regex表达式找到它们。两个非常方便的正则表达式允许您找到成功或先于另一个出现的事件:

(?<=prefix)find找到一个跟随另一个的模式。

find(?=suffix)找到一个先于另一个模式的模式。

如果您的数字关键字(774)始终位于&#34; SIZE&#34;你可以这样找到:\w+(?=\sSIZE)

如果其他关键字始终位于&#34; EX&#34;和&#34;细节及#34;你可以这样找到它们:(?<=EX\s)(\w+\s)+(?=DETAILS)

你可以把整个事情放在一起:\w+(?=\sSIZE)|(?<=EX\s)(\w+\s)+(?=DETAILS)

缺点是&#34; EX&#34;之间的关键字和&#34; DETAILS&#34;将作为一场比赛返回。但您可以在之后拆分匹配,如下所示:

Const input As String = "2 3 3 4 4 A A B B SHEET 1 OF 1 774 SIZE SCALE 24.000-47.999 12.000-23.999 CON BAG WIRE 90in. EX Bos00232940 Bos00320491 Das1234 Das3216 DETAILS 1 2 RAGE"

Dim matches = Regex.Matches(input, "\w+(?=\sSIZE)|(?<=EX\s)(\w+\s)+(?=DETAILS)")
For Each m As Match In matches
    Dim words = m.Value.Split(" "c)
    For Each word As String In words
        If word.Length > 0 Then ' Suppress the last empty word.
            Console.WriteLine(word)
        End If
    Next
Next

输出:

  

774
  Bos00232940
  Bos00320491
  Das1234
  Das3216

答案 2 :(得分:0)

如何使用正则表达式...

    Dim re As New Regex("[\t ](?<w>((774)|(Bos)|(Das))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled)

    Private Sub test()
        Dim Lines() As String = {"2 3 3 4 4 A A B B SHEET 1 OF 1 774 SIZE SCALE 24.000-47.999 12.000-23.999 CON BAG WIRE 90in. EX Bos00232940 Bos00320491 Das1234 Das3216 DETAILS 1 2 RAGE"}
        Dim words As New List(Of String)
        For Each s As String In Lines
            Dim mc As MatchCollection = re.Matches(s)
            For Each m As Match In mc
                words.Add(m.Groups("w").Value)
            Next
        Next
    End Sub

正则表达式崩溃......

[\t ]     Single tab or space (there is an alternative for whitespace too)

(?<w>     Start of capture group called "w" This the the text returned later in the "m.Groups"

((774)|(Bos)|(Das))     one of the 3 blobs of text

[a-z0-9]*        any a-z or 0-9 character, * = any number of them

)         End of Capture group "w" from above.