首页 文章

Bigquery - 使用SemiJoins过滤重复字段

提问于
浏览
1

我正在尝试根据重复字段中的项是否位于另一个表的列中来从一个表中选择记录 . 在我的代码中明确列出我正在测试的项目时,我已经能够这样做了,但是当从另一个表中选择时却没有 . 让我演示使用trigrams数据集:

假设我想选择在某些年份出现的所有记录 . 但我不仅仅想要那些年份的数据 - 我想要与这些记录相关的所有数据 . 如果我只是想要几年的数据,我可以做这样的事情(这是有效的):

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
    SOME(cell.value in ('1800', '1801')) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid

但是,我没有将'1800'和'1801'编码到我的查询中,而是有一个表 years ,其中包含我感兴趣的一组年份 . 我希望这可以工作:

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count,
    SOME(cell.value in (SELECT year_as_str FROM [mydataset.years])) WITHIN RECORD AS valid
FROM [publicdata:samples.trigrams]
HAVING valid

这不起作用,因为bigquery要求半连接是 WHEREHAVING 子句的一部分 .

所以我尝试重新排列(回到第一个查询):

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802')) WITHIN RECORD

这会导致错误 Encountered " "WITHIN" "WITHIN "" ... Was expecting <EOF>

所以现在没有 WITHIN RECORD

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
HAVING SOME(cell.value in ('1801', '1802'))

这会导致错误 SELECT clause has mix of aggregations '...' and fields '...' without GROUP BY clause

但我不是在聚合!所以现在我将过滤器移动到 WHERE

SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
FROM [publicdata:samples.trigrams]
WHERE SOME(cell.value in ('1801', '1802'))

这告诉我 Invalid function name: SOME . 什么?!

有没有办法通过BigQuery获取我正在寻找的行为?

2 回答

  • 0

    下面解决了你的例子,我希望你能将它扩展到你的实际用例(如果你想要解决方案)

    SELECT 
        ngram, cell.value, cell.volume_count, 
        cell.volume_fraction, cell.page_count, cell.match_count
    FROM [publicdata:samples.trigrams] AS trigrams
    JOIN (
      SELECT ngram AS qualified
      FROM (
        FLATTEN((SELECT ngram, cell.value AS value
          FROM (FLATTEN([publicdata:samples.trigrams], cell.value))), value)
      ) AS t
      JOIN [mydataset.years] AS y
      ON y.year_as_str = t.value
      GROUP BY 1
    ) AS valid
    ON valid.qualified = trigrams.ngram
    

    请注意 [publicdata:samples.trigrams] 字段 cell.valueREPEATED STRING 的事实 - 这就是为什么你看到"extra" FLATTEN的东西

  • 1

    您可以使用OMIT RECORD IF子句 . 这可能需要双重否定,因为您需要省略满足某些条件的记录 . 以下查询应该有效:

    SELECT ngram, first, second, third, fourth, fifth, cell.value, cell.volume_count
    FROM [publicdata:samples.trigrams]
    OMIT RECORD IF EVERY(cell.value NOT IN ('1801', '1802'))
    

相关问题