如何过滤/提取字符串?
我已使用itextsharp将PDF文件转换为String,并且我将文本显示在Richtextbox1中。
然而,在Richtextbox中我不需要太多不相关的文本。 有没有办法可以根据关键词,文本的整个长度显示我想要的文字。
PDF格式文本对话后在textrichbox1中显示的文本示例:
**774**
**Bos00232940
Bos00320491
Das1234
Das3216**
RAGE*
因此关键字将是“ Bos ”,“ Das ”,“ 774 ”。并且将在richtextbox1中显示的新文本显示在下面,而不是上面的整个文本。
*Bos00232940
Bos00320491
Das1234
Das3216
774*
这是我到目前为止所拥有的。但它不起作用它仍然在richtextbox中显示整个PDF。
Public Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim pdffilename As String
pdffilename = TextBox1.Text
Dim filepath = "c:\temp\" & TextBox1.Text & ".pdf"
Dim thetext As String
thetext = GetTextFromPDF(filepath)
Dim lines() As String = System.Text.RegularExpressions.Regex.Split(thetext, Environment.NewLine)
Dim keywords As New List(Of String)
keywords.Add("Bos")
keywords.Add("Das")
keywords.Add("774")
Dim newTextLines As New List(Of String)
For Each line As String In lines
For Each keyw As String In thetext
If line.Contains(keyw) Then
newTextLines.Add(line)
Exit For
End If
Next
Next
RichTextBox1.Text = String.Join(Environment.NewLine, newTextLines.ToArray)
End Sub
感谢大家的帮助。下面是有效的代码,完全按照我的意愿行事。
Public Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim pdffilename As String
pdffilename = TextBox1.Text
Dim filepath = "c:\temp\" & TextBox1.Text & ".pdf"
Dim thetext As String
thetext = GetTextFromPDF(filepath)
Dim re As New Regex("[\t ](?<w>((774)|(Bos)|(Das))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled)
Dim Lines() As String = {thetext}
Dim words As New List(Of String)
For Each s As String In Lines
Dim mc As MatchCollection = re.Matches(s)
For Each m As Match In mc
words.Add(m.Groups("w").Value)
Next
Next
RichTextBox1.Text = String.Join(Environment.NewLine, words.ToArray)
End Sub
答案 0 :(得分:1)
For Each Word As String In thetext.Split(" ")
For Each key As String In keywords
If Word.StartsWith(key) Then
newTextLines.Add(Word)
Continue For
End If
Next
Next
或使用LINQ:
Dim q = From word In thetext.Split(" ")
Where keywords.Any(Function(s) word.StartsWith(s))
Select word
RichTextBox1.Text = String.Join(Environment.NewLine, q.ToArray())
答案 1 :(得分:1)
如果事先不知道关键字但知道它们出现在哪种情况下,您可以使用Regex表达式找到它们。两个非常方便的正则表达式允许您找到成功或先于另一个出现的事件:
(?<=prefix)find
找到一个跟随另一个的模式。
find(?=suffix)
找到一个先于另一个模式的模式。
如果您的数字关键字(774)始终位于&#34; SIZE&#34;你可以这样找到:\w+(?=\sSIZE)
。
如果其他关键字始终位于&#34; EX&#34;和&#34;细节及#34;你可以这样找到它们:(?<=EX\s)(\w+\s)+(?=DETAILS)
。
你可以把整个事情放在一起:\w+(?=\sSIZE)|(?<=EX\s)(\w+\s)+(?=DETAILS)
。
缺点是&#34; EX&#34;之间的关键字和&#34; DETAILS&#34;将作为一场比赛返回。但您可以在之后拆分匹配,如下所示:
Const input As String = "2 3 3 4 4 A A B B SHEET 1 OF 1 774 SIZE SCALE 24.000-47.999 12.000-23.999 CON BAG WIRE 90in. EX Bos00232940 Bos00320491 Das1234 Das3216 DETAILS 1 2 RAGE"
Dim matches = Regex.Matches(input, "\w+(?=\sSIZE)|(?<=EX\s)(\w+\s)+(?=DETAILS)")
For Each m As Match In matches
Dim words = m.Value.Split(" "c)
For Each word As String In words
If word.Length > 0 Then ' Suppress the last empty word.
Console.WriteLine(word)
End If
Next
Next
输出:
774
Bos00232940
Bos00320491
Das1234
Das3216
答案 2 :(得分:0)
如何使用正则表达式...
Dim re As New Regex("[\t ](?<w>((774)|(Bos)|(Das))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled)
Private Sub test()
Dim Lines() As String = {"2 3 3 4 4 A A B B SHEET 1 OF 1 774 SIZE SCALE 24.000-47.999 12.000-23.999 CON BAG WIRE 90in. EX Bos00232940 Bos00320491 Das1234 Das3216 DETAILS 1 2 RAGE"}
Dim words As New List(Of String)
For Each s As String In Lines
Dim mc As MatchCollection = re.Matches(s)
For Each m As Match In mc
words.Add(m.Groups("w").Value)
Next
Next
End Sub
正则表达式崩溃......
[\t ] Single tab or space (there is an alternative for whitespace too)
(?<w> Start of capture group called "w" This the the text returned later in the "m.Groups"
((774)|(Bos)|(Das)) one of the 3 blobs of text
[a-z0-9]* any a-z or 0-9 character, * = any number of them
) End of Capture group "w" from above.