我编写了以下代码,该代码肯定是从列表中删除了一些URL,但是我看到许多URL仍包含我要查找的参数。
我添加了
row[0].lower()
尝试对此进行补救,但仍然无法正常工作。
带有参数的URL如下:
?currentPage = 2&Nrpp = 24&No = 24 ?pagination = 1&currentPage = 2
与“?”有关吗?
import csv
values = [
"/blog",
"nrpp",
"pagination"
]
added_vals = []
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
writer = csv.writer(out)
for row in csv.reader(inp):
for value in values:
if value not in row[0].lower() and row[0] not in added_vals:
writer.writerow(row)
added_vals.append(row[0])
该文件应该基本上返回相同的文件,但行数要少得多。以下是一些示例URL:
/ category / dresses-5699972 / juna-rose / N-ihuZ20cbZc1y?currentPage = 29&Nrpp = 24&No = 672 / category / dresses-5699972 / tall-dresses-204374 / purple / N-ij9ZbyvZc1y / category / dresses-5699972 / pencil-dresses-204531 / short-sleeve / N-iisZ21b9Zc1y?pagination = 1&currentPage = 2 / category / dresses-5699972 / tan / N-ihuZbyyZc1y?currentPage = 10&Nrpp = 24&No = 216
答案 0 :(得分:0)
这是问题所在:您遍历了三个值。因此,您要测试第一个值是否在row[0]
中。如果不是这样,您仍将行[0]添加到added_vals
中,因此将不再对该行进行测试,也将无法对其进行写入。
您应该执行的操作类似于:
for row in csv.reader(inp):
if not any(v.lower() in row.lower() for v in values):
writer.writerow(row)
此外,使用in
可能会导致很多假阴性,所以这样做会更好:
import re
rx = re.compile(r".*\?currentPage=\d+&Nrpp=\d+&No=\d+\?pagination=\d+¤tPage=\d+.*", re.IGNORECASE)
for row in csv.reader(inp):
if not rx.match(row):
writer.writerow(row)
有关正则表达式的更多信息:https://docs.python.org/3.7/library/re.html
答案 1 :(得分:0)
我不确定您的added_vals
变量的作用,但我认为您正在使事情复杂化。
它应该很容易修复:
import csv
values = [
"/blog",
"nrpp",
"pagination"
]
# Open input and output files
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
writer = csv.writer(out)
# Iterate through the rows in the file
for row in csv.reader(inp):
url = row[0].lower()
# Iterate through the values, and see if one matches
for value in values:
# If we find a match, cancel the current `for` loop
if value in url:
break
else:
# This will only run if we finished the `for` loop without a `break`.
# So, if we reached this code, no match was found
writer.writerow(row)
如果使用正则表达式,代码将变得更加紧凑:
import csv
import re
rx = re.compile(r"^[^?]*/blog|[?&](currentPage|nrpp)=", re.IGNORECASE)
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
writer = csv.writer(out)
for row in csv.reader(inp):
if not rx.search(row[0]):
writer.writerow(row)
替代版本,更接近您的原始代码:
import csv
values = [
"/blog",
"nrpp",
"pagination"
]
# Open input and output files
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
writer = csv.writer(out)
# Iterate through the rows in the file
for row in csv.reader(inp):
url = row[0].lower()
# Iterate through the values, and see if one matches
matches = False
for value in values:
if value in url:
matches = True
break
# If none match, write to output csv
if not matches:
writer.writerow(row)