Question

我有来自不同学生的论文的278个Html文件，每个文件包含学生ID，名字和最后的格式

<p>Student ID: 000000</p>
<p>First Name: John</p>
<p>Last Name: Doe</p>

我正在尝试从所有这些文件中提取学生ID，有没有办法在X和Y之间提取数据？ X为“<p>Student ID:”，Y为“</p>”，应为我们提供ID

您建议使用哪种方法/语言/概念/软件来完成这项工作？

Answer 1

我建议您使用python脚本。如果你第一次使用python，那没关系。 python是如此简单的脚本语言，在谷歌中有很多引用。

1）语言： python （版本2.7）

2）库： beautifulsoup （你可以用pip下载这个（pip是包管理器程序，pip可以安装在python安装程序中）

逐个遍历文件并打开本地文件。并使用beautifulsoup解析HTML内容。（见本部分https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag）

然后，从＆lt; p＆gt;中提取内容标签。它返回“学生ID：000000”。

将此字符串拆分为“：”。这返回str [0]和str [1]。

str [1]是您想要的学生编号（也许您可以删除空格字符...调用'Hel lo'.strip（） - ＆gt; Hello

如果您需要帮助，请回复。

Answer 2

使用java：

import java.io.File;
import java.io.IOException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;

public class StudentIDsRetriever {

    public static void main(String[] args) throws IOException {
        File dir = new File("htmldir");
        String[] htmlFiles = dir.list();
        List<String> studentIds = new ArrayList<>();
        List<String> emailDs = new ArrayList<>();
        for (String htmlFile : htmlFiles) {
            Path path = FileSystems.getDefault().getPath("htmldir", htmlFile);
            List<String> lines = Files.readAllLines(path);
            for (String str : lines) {
                if (str.contains("<p>Student ID:")) {
                    String idTag = str.substring(str.indexOf("<p>Student ID:"));
                    String id = idTag.substring("<p>Student ID:".length(), idTag.indexOf("</p>"));
                    System.out.println("Id is "+id);
                    studentIds.add(id);
                }

                if (str.contains("@") && (str.contains(".com") || str.contains(".co.in"))) {
                    String[] words = str.split(" ");
                    for (String word : words) 
                        if (word.contains("@") && (word.contains(".com") || word.contains(".co.in"))) 
                            emailDs.add(word);
                }

            }
        }
        System.out.println("Student list is "+studentIds);
        System.out.println("Student email list is "+emailDs);
    }
}

P.S：这适用于Java7 +

从多个文件中提取数据

2 个答案: