Question

我有几百万个自定义txt文件，每个文件都使用这样的内容生成。我之前使用ruby（Nokogiri）逐个解析这些文件，并从这些文件中提取内容并存储在数据库中。

    <doc id="12" url="http://en.wikipedia.org/wiki?curid=12" title="Anarchism">
     ...
     ...
     ...
      few hundred lines of text
     ...

     </doc>

然而，使用ruby似乎太慢了，因为运行这个单个进程需要两周多才能完成绝大多数这些文章文件。所以我试图从shell命令本身提取所需的数据并完全跳过ruby。但我仍然天真地使用正则表达式。

到目前为止，我已经能够提取这些数据。

     informations=`grep -E '<doc' F1.txt`
     id=`echo $informations | grep -Po '\bid="[0-9]+"' | grep -Eo '[0-9]+'`
     url=`echo $informations | grep -Po 'https?:\/\/(.*?)([A-Za-z]|[.]|[\/]|[?]|[=]|[0-9])*'`
     title=`echo $informations | grep -Po '(?<=title=").*(?=">)'`

但我还需要捕获doc标签之间的所有内容。

     body=`a command to take those few hundreed lines between the two doc tags`.

我试图在grep环境中使用它/（？＆lt; =＆gt ;;(.)*(?=</doc>)/m。 grep -Po '(?<=">)(.)*(?=<\/doc>)' F1.txt 但它不会返回任何匹配。关于如何完成这项工作的任何建议？

Answer 1

使用此

<doc.*?</doc>

<强>更新：

 grep -P '<doc(.|\n)*?</doc>' file.txt

使用-P选项

Answer 2

awk '/<doc/,/<\/doc>/' YourFile

将在第一场比赛时停止

正则表达式提取两个标签之间的所有内容

2 个答案: