解析文本文件以获取文本选择

时间:2018-07-31 20:25:39

标签: ruby regex parsing

我有一个文本文件,我想在其中截取一段文本,以便将其分成两个数组,一个是成分,另一个是方向。

对于成分,我可以做下面的事情,但不能保证其完整性。

ingredients = []
list.each_line do |l|
  ingredients << l if l =~ /\d\s?\w.*/
end

这是文本blob:

635860
581543
2011-03-21T13:50:10Z

Image:black bean soup.jpg|right|Mexican Black Bean Soup

== Ingredients ==
1lb black beans
2 tbsp extra-virgin olive oil
2 onions, large, diced
6 cloves garlic, minced
1 cup tomato, peeled, seeded, and chopped (fresh or canned)
1 sprig epazote, fresh or dried (optional)
1 tbsp chipotle pepper|chipotle chiles, canned, chopped (or ¼ tsp cayenne)
1 tsp cumin, ground
1 tsp coriander seed|coriander, ground
2 tsp salt

== Directions ==
Soak the black beans for 2 hours and drain.
In a deep pot, heat the olive oil over medium heat.
Add the onions and cook about 5 minutes.
Until translucent.
Add the black beans|beans, garlic, and 6 cups cold water.
Bring to a boil, skimming any foam that rises to the surface.
Reduce to a simmer.
In an hour or when the black beans|beans are soft, add the tomato, epazote, chipotle chile peppers|chile, cumin, coriander, and salt.
Continue cooking until the black beans|beans start to break down and the broth begins to thicken.
Taste for seasoning and add salt and pepper if needed.
If you’re serving this soup immediately, you may want to thicken it by puréeing a cup or two of the black beans|beans in a blender or food processor and then recombining them with the rest of the soup.
The soup will thicken on its own if refrigerated overnight.

Category:Black bean Recipes
Category:Chile pepper Recipes
Category:Chipotle pepper Recipes
Category:Epazote Recipes
Category:Mexican Soups
Category:Tomato Recipes
bx0ztz9xbf8qr9z4gwkad26u6q3hly3

1 个答案:

答案 0 :(得分:1)

我在这里要做的是,不要尝试匹配您可能无法控制的数据,而是要匹配看起来好像您可以控制的数据。具体地说,在我看来,== Ingredients ==行和== Directions ==Category:Tomato Recipes行可能是文件格式的一部分,而不是由用户输入的。因此,只要您看到这样的一行,我就将文本拆分:

sections = list.each_line.slice_before do |line|
  line.match?(/\A(==|[a-zA-Z]+:)/)
end.entries

然后您就可以assoc分组中的数据:

puts sections.assoc("== Ingredients ==\n")
puts '---'
puts sections.assoc("== Directions ==\n")

这仍然存在一些缺陷(如果用户输入Note: Preheat oven first之类的内容作为说明的一部分,则最终会以元数据为单位进行拆分),但应该向前迈出一大步,并且可以进行调整从这里。