选择一个变量不会在组中的另一个变量中共同出现的观察结果(SQL)

时间:2018-01-05 17:43:38

标签: sql google-bigquery

我正在使用Google的BigQuery中的patents-public-data.patents.publications_201710表。我的目标是确定潜在的公司拥有的专利。由于发明人必须是个人而不是公司,理论上我可以通过识别发明人名单中未找到受让人的记录来做到这一点。棘手的部分是,同一个publication_number经常有多个发明人甚至多个受让人。

该表最初看起来像:

Row publication_number  assignee                        inventor     
1   US-7011573-B2       Mcarthur Richard C              MCARTHUR RICHARD C.  
                        Holmes Dennis G                 HOLMES DENNIS G.     
                        Mcarthur Ronald J               MCARTHUR RONALD J.   
2   US-8746747-B2   IPS Corporation—Weld-On Division    MCPHERSON TERRY R

我尝试过以下(不成功)查询:

#standard sql
SELECT
  p.publication_number,
  assignee,
  inventor
FROM
  `patents-public-data.patents.publications_201710` AS p,
  p.assignee assignee,
  p.inventor inventor
WHERE
  assignee not in (inventor) #want to run this within each p.publication_number-assignee group somehow
  AND p.publication_number IN ('US-8746747-B2',
    'US-7011573-B2')

此查询产生以下输出:

Row publication_number  assignee                            inventor     
1   US-7011573-B2       Mcarthur Richard C                  MCARTHUR RICHARD C.  
2   US-7011573-B2       Mcarthur Richard C                  HOLMES DENNIS G.     
3   US-7011573-B2       Mcarthur Richard C                  MCARTHUR RONALD J.   
4   US-7011573-B2       Holmes Dennis G                     MCARTHUR RICHARD C.  
5   US-7011573-B2       Holmes Dennis G                     HOLMES DENNIS G.
6   US-7011573-B2       Holmes Dennis G                     MCARTHUR RONALD J.   
7   US-7011573-B2       Mcarthur Ronald J                   MCARTHUR RICHARD C.  
8   US-7011573-B2       Mcarthur Ronald J                   HOLMES DENNIS G.     
9   US-7011573-B2       Mcarthur Ronald J                   MCARTHUR RONALD J.   
10  US-8746747-B2       IPS Corporation—Weld-On Division    MCPHERSON TERRY R   

首先,我想看看受让人是否包含在(不仅仅等于)发明人变量中,因为受让人省略了“。”中间初始之后。受让人变量中的字符串通常等于发明人,但并非总是如此。一些例外的例子是:

  • 受让人:“Daryl A. KRUPA”;发明家:“KRUPA Daryl A.”
  • 受让人:“KRUPADANAM Gazula Levi DAVID”;发明家:“DAVID KRUPADANAM,Gazula Levi”

我知道我最终不得不忍受一些误报(记录实际上是分配给个人而不是公司),但如果我能通过我的查询解决其中一些问题,我宁愿这样做。< / p>

第二,假设第一个问题得到解决,我不能只看到受让人是否包含在发明人变量中,因为这会不恰当地保留2-4,6-8行。

我想要的输出就是:

Row publication_number  assignee                            inventor     
1   US-8746747-B2       IPS Corporation—Weld-On Division    MCPHERSON TERRY R   

我发现了一些相关的帖子herehere,但它们并未完全解决我的问题或不使用SQL。

对于我想要的输出,什么是适当的标准SQL查询?

2 个答案:

答案 0 :(得分:2)

以下是BigQuery Standard SQL

   
#standardSQL
CREATE TEMP FUNCTION normalizeString(phrase STRING) AS ((
  SELECT AS STRUCT phrase original, STRING_AGG(LOWER(word), ' ' ORDER BY word) normalized 
  FROM UNNEST(SPLIT(REGEXP_REPLACE(phrase, r'[,.]', ''), ' ')) word
));
CREATE TEMP FUNCTION removeDups(arr1 ARRAY<STRING>, arr2 ARRAY<STRING>) AS (ARRAY(
  SELECT a.original 
  FROM UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr1) a ORDER BY b.normalized)) a, 
  UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr2) a ORDER BY b.normalized)) i 
  GROUP BY 1 
  HAVING COUNTIF(a.normalized = i.normalized) = 0
));
SELECT publication_number,
  removeDups(assignee, inventor) assignee,
  removeDups(inventor, assignee) inventor
FROM `patents-public-data.patents.publications_201710`
WHERE publication_number IN ('US-8746747-B2', 'US-7011573-B2')
AND (ARRAY_LENGTH(removeDups(assignee, inventor)) > 0
OR ARRAY_LENGTH(removeDups(inventor, assignee)) > 0)   

您可以使用以下示例使用虚拟数据

来测试/播放上面的内容
#standardSQL
CREATE TEMP FUNCTION normalizeString(phrase STRING) AS ((
  SELECT AS STRUCT phrase original, STRING_AGG(LOWER(word), ' ' ORDER BY word) normalized 
  FROM UNNEST(SPLIT(REGEXP_REPLACE(phrase, r'[,.]', ''), ' ')) word
));
CREATE TEMP FUNCTION removeDups(arr1 ARRAY<STRING>, arr2 ARRAY<STRING>) AS (ARRAY(
  SELECT a.original 
  FROM UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr1) a ORDER BY b.normalized)) a, 
  UNNEST(ARRAY(SELECT normalizeString(a) b FROM UNNEST(arr2) a ORDER BY b.normalized)) i 
  GROUP BY 1 
  HAVING COUNTIF(a.normalized = i.normalized) = 0
));
WITH `patents-public-data.patents.publications_201710` AS (
  SELECT 'US-8746747-B2' publication_number, ['IPS Corporation—Weld-On Division'] assignee, ['MCPHERSON TERRY R'] inventor UNION ALL 
  SELECT 'US-7011573-B2', ['Mcarthur Richard C', 'Holmes Dennis G', 'Mcarthur Ronald J'], ['MCARTHUR RICHARD C.', 'HOLMES DENNIS G.', 'MCARTHUR RONALD J.'] UNION ALL 
  SELECT 'TestA', ['Daryl A. KRUPA'], ['KRUPA Daryl A.'] UNION ALL 
  SELECT 'TestB', ['KRUPADANAM Gazula Levi DAVID'], ['DAVID KRUPADANAM, Gazula Levi'] 
)
SELECT publication_number,
  removeDups(assignee, inventor) assignee,
  removeDups(inventor, assignee) inventor
FROM `patents-public-data.patents.publications_201710`
WHERE publication_number IN ('US-8746747-B2', 'US-7011573-B2', 'TestA', 'TestB')
AND (ARRAY_LENGTH(removeDups(assignee, inventor)) > 0
OR ARRAY_LENGTH(removeDups(inventor, assignee)) > 0)  

注意:您可以在normalizeString函数中控制“名称规范化”的逻辑。在我提供的示例中 - 我只是删除点和逗号 - 但您可能想要增强此

答案 1 :(得分:1)

以下是受让人不是发明人的专利/受让人信息:

SELECT p.publication_number, a.assignee
FROM `patents-public-data.patents.publications_201710` p JOIN
     p.assignee a
     ON p.assignee = a.assignee  LEFT JOIN -- guessing what the join keys are
     p.inventor i
     ON a.assignee = i.inventor  -- guessing what the join keys are
WHERE p.publication_number IN ('US-8746747-B2', 'US-7011573-B2') AND
      i.inventor IS NULL;

目前还不清楚要用于连接的字段。