Question

我想从表中提取一些像" email addresses " ..这样的数据，这些数据是PDF文件，并使用我提取的电子邮件地址向这些人发送电子邮件。

到目前为止，我通过搜索网络找到了什么：

我必须将PDF文件转换为Excel才能轻松读取数据，并根据需要使用它们。
我找到了一些像itextsharp或PDFsharp这样的免费dll。

但我没有找到任何代码片段帮助在C＃中执行此操作。有什么办法吗？

Answer 1

您绝对不必将PDF转换为Excel。首先，请确定您的PDF是包含文本数据还是扫描图像。如果它包含文本数据，那么你正确使用“一些免费的dll”。我推荐iTextSharp因为它很受欢迎且易于使用。

现在有争议的部分。如果您不需要坚如磐石的解决方案，最简单的方法是将所有PDF读取为字符串，然后使用正则表达式检索电子邮件。
以下是使用iTextSharp阅读PDF并提取电子邮件的示例（不完美）：

public string PdfToString(string fileName)
{
    var sb = new StringBuilder();    
    var reader = new PdfReader(fileName);
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var strategy = new SimpleTextExtractionStrategy();
        string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
        text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
        sb.Append(text);
    }
    reader.Close();        
    return sb.ToString();
}
//adjust expression as needed
Regex emailRegex = new Regex("Email Address (?<email>.+?) Passport No");
public IEnumerable<string> ExtractEmails(string content)
{   
    var matches = emailRegex.Matches(content);
    foreach (Match m in matches)
    {
        yield return m.Groups["email"].Value;
    }
}

Answer 2

使用bytescout PDF Extractor SDK，我们可以将整个页面提取到csv，如下所示。

CSVExtractor extractor = new CSVExtractor();
extractor.RegistrationName = "demo";
extractor.RegistrationKey = "demo";

TableDetector tdetector = new TableDetector();
tdetector.RegistrationKey = "demo";
tdetector.RegistrationName = "demo";

// Load the document
extractor.LoadDocumentFromFile("C:\\sample.pdf");
tdetector.LoadDocumentFromFile("C:\\sample.pdf");

int pageCount = tdetector.GetPageCount();

for (int i = 1; i <= pageCount; i++)
{
    int j = 1;

        do
        {
                extractor.SetExtractionArea(tdetector.GetPageRect_Left(i),
                tdetector.GetPageRect_Top(i),
                tdetector.GetPageRect_Width(i),
                tdetector.GetPageRect_Height(i)
            );

            // and finally save the table into CSV file
            extractor.SavePageCSVToFile(i, "C:\\page-" + i + "-table-" + j + ".csv");
            j++;
        } while (tdetector.FindNextTable()); // search next table
}

Answer 3

ggplot()+
  geom_line(data=A, aes(x=prime, y=RT, group=familiarity,linetype=familiarity), size=1) +
  geom_line(data=B, aes(x=prime, y=RT, group=familiarity,linetype=familiarity), size=1)+
  expand_limits(y=c(500,650))

如何在c＃中将pdf文件转换为excel

3 个答案: