我正在开发一个需要大量数据集的项目。我找到了一个足够大的数据集(版本转储在Stream a file to AWS S3 using Akka Streams (via Alpakka) in Play Framework,大约5GB)格式化
/type/edition /books/OL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
/type/edition /books/OL10000179M 4 2010-04-24T17:54:01.503315 {"publishers": ["Stationery Office"], "physical_format": "Hardcover", "subtitle": "26 January - 4 February 1998", "title": "Parliamentary Debates, House of Lords, 1997-98", "isbn_10": ["0107805855"], "identifiers": {"goodreads": ["2862283"]}, "isbn_13": ["9780107805852"], "edition_name": "5th edition", "languages": [{"key": "/languages/eng"}], "number_of_pages": 124, "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "latest_revision": 4, "key": "/books/OL10000179M", "authors": [{"key": "/authors/OL2645811A"}], "publish_date": "January 1999", "works": [{"key": "/works/OL7925994W"}], "type": {"key": "/type/edition"}, "subjects": ["Bibliographies, catalogues, discographies", "POLITICS & GOVERNMENT", "Reference works", "Bibliographies & Indexes", "Reference"], "revision": 4}
etc...
我想提取JSON部分(第五个字段)。
我正在尝试使用str.replace()(在大文件的50行子集上),但它很挑剔。 我认为这样的事情会起作用,但事实并非如此(没有任何改变/替换)
with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
for line in file:
print(line.replace(".*({.*})$", "\1"), end="")
我试图逐列解析它(一个标识每一列的正则表达式)然后我遇到了令我感到困惑的事情。以下代码
with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
for line in file:
print(line.replace("/type/edition\t/books/", "WORK PLZ"), end="")
产量
WORK PLZOL10000135M 4 2010-04-24T17:54:01.503315 {"publishers": ["Bernan Press"], "physical_format": "Hardcover", "subtitle": "9th November - 3rd December, 1992", "key": "/books/OL10000135M", "title": "Parliamentary Debates, House of Lords, Bound Volumes, 1992-93", "identifiers": {"goodreads": ["6850240"]}, "isbn_13": ["9780107805401"], "languages": [{"key": "/languages/eng"}], "number_of_pages": 64, "isbn_10": ["0107805405"], "publish_date": "December 1993", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "authors": [{"key": "/authors/OL2645777A"}], "latest_revision": 4, "works": [{"key": "/works/OL7925046W"}], "type": {"key": "/type/edition"}, "subjects": ["Government - Comparative", "Politics / Current Events"], "revision": 4}
WORK PLZOL10000179M 4 2010-04-24T17:54:01.503315 {"publishers": ["Stationery Office"], "physical_format": "Hardcover", "subtitle": "26 January - 4 February 1998", "title": "Parliamentary Debates, House of Lords, 1997-98", "isbn_10": ["0107805855"], "identifiers": {"goodreads": ["2862283"]}, "isbn_13": ["9780107805852"], "edition_name": "5th edition", "languages": [{"key": "/languages/eng"}], "number_of_pages": 124, "last_modified": {"type": "/type/datetime", "value": "2010-04-24T17:54:01.503315"}, "latest_revision": 4, "key": "/books/OL10000179M", "authors": [{"key": "/authors/OL2645811A"}], "publish_date": "January 1999", "works": [{"key": "/works/OL7925994W"}], "type": {"key": "/type/edition"}, "subjects": ["Bibliographies, catalogues, discographies", "POLITICS & GOVERNMENT", "Reference works", "Bibliographies & Indexes", "Reference"], "revision": 4}
但
with fileinput.input(files=("testData.txt"), inplace=True, backup='.bak') as file:
for line in file:
print(line.replace("/type/edition\t/books/\w+", "WORK PLZ"), end="")
什么都不做。似乎\ w +在/ books /.
之后没有拿起字母数字字符串我正在用正则表达式做些什么吗?有没有更好的方法来解决这个问题?
答案 0 :(得分:1)
(如评论中所述)str.replace
并不理解正则表达式。这就解释了为什么你的代码失败了。
我会对字符串进行分区(假设在 json字符串之前没有任何{
char ),然后解析为json:
import json
with open("test.txt") as f:
for line in f:
json_expr = "{"+line.partition("{")[2]
the_dict = json.loads(json_expr)
或根据空格分割但使用maxsplit参数来限制分割并获取最后一个元素(json数据)。由于json表达式是最后一项,因此它起作用:
json_expr = line.split(None,4)[-1]