Question

我编写了以下代码，该代码肯定是从列表中删除了一些URL，但是我看到许多URL仍包含我要查找的参数。

我添加了

row[0].lower()

尝试对此进行补救，但仍然无法正常工作。

带有参数的URL如下：

？currentPage = 2＆Nrpp = 24＆No = 24 ？pagination = 1＆currentPage = 2

与“？”有关吗？

import csv

values =  [
   "/blog",
   "nrpp",
   "pagination"
]  

added_vals = []

with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
  writer = csv.writer(out)
  for row in csv.reader(inp):
     for value in values:
         if value not in row[0].lower() and row[0] not in added_vals:
            writer.writerow(row)
         added_vals.append(row[0])

该文件应该基本上返回相同的文件，但行数要少得多。以下是一些示例URL：

/ category / dresses-5699972 / juna-rose / N-ihuZ20cbZc1y？currentPage = 29＆Nrpp = 24＆No = 672 / category / dresses-5699972 / tall-dresses-204374 / purple / N-ij9ZbyvZc1y / category / dresses-5699972 / pencil-dresses-204531 / short-sleeve / N-iisZ21b9Zc1y？pagination = 1＆currentPage = 2 / category / dresses-5699972 / tan / N-ihuZbyyZc1y？currentPage = 10＆Nrpp = 24＆No = 216

Answer 1

这是问题所在：您遍历了三个值。因此，您要测试第一个值是否在row[0]中。如果不是这样，您仍将行[0]添加到added_vals中，因此将不再对该行进行测试，也将无法对其进行写入。

您应该执行的操作类似于：

for row in csv.reader(inp):
     if not any(v.lower() in row.lower() for v in values):
         writer.writerow(row)

此外，使用in可能会导致很多假阴性，所以这样做会更好：

import re

rx = re.compile(r".*\?currentPage=\d+&Nrpp=\d+&No=\d+\?pagination=\d+&currentPage=\d+.*", re.IGNORECASE)

for row in csv.reader(inp):
     if not rx.match(row):
         writer.writerow(row)

有关正则表达式的更多信息：https://docs.python.org/3.7/library/re.html

Answer 2

我不确定您的added_vals变量的作用，但我认为您正在使事情复杂化。

它应该很容易修复：

import csv

values =  [
   "/blog",
   "nrpp",
   "pagination"
]

# Open input and output files
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
    writer = csv.writer(out)

    # Iterate through the rows in the file
    for row in csv.reader(inp):
        url = row[0].lower()

        # Iterate through the values, and see if one matches
        for value in values:
            # If we find a match, cancel the current `for` loop
            if value in url:
                break
        else:
            # This will only run if we finished the `for` loop without a `break`.
            # So, if we reached this code, no match was found
            writer.writerow(row)

如果使用正则表达式，代码将变得更加紧凑：

import csv
import re

rx = re.compile(r"^[^?]*/blog|[?&](currentPage|nrpp)=", re.IGNORECASE)

with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
    writer = csv.writer(out)

    for row in csv.reader(inp):
        if not rx.search(row[0]):
            writer.writerow(row)

替代版本，更接近您的原始代码：

import csv

values =  [
   "/blog",
   "nrpp",
   "pagination"
]

# Open input and output files
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
    writer = csv.writer(out)

    # Iterate through the rows in the file
    for row in csv.reader(inp):
        url = row[0].lower()

        # Iterate through the values, and see if one matches
        matches = False
        for value in values:
            if value in url:
                matches = True
                break

        # If none match, write to output csv
        if not matches:
            writer.writerow(row)

从.csv中删除包含特定子字符串的URL字符串

2 个答案: