首页 文章

根据列spark数据帧获取重复的行[重复]

提问于
浏览
-1

这个问题在这里已有答案:

我试图根据列ID删除重复的行 . 如何获取具有重复“id”的已删除数据?这是我一直在努力的代码 .

val datatoBeInserted = data.select("id", "is_enabled", "code", "description", "gamme", "import_local", "marque", "type_marketing", "reference", "struct", "type_tarif", "family_id", "range_id", "article_type_id")
val cleanedData = datatoBeInserted.dropDuplicates("id")

使用上面的查询,cleaningData将为所有行提供不重复的“id” . 现在,我想弄清楚由于重复而筛选出哪些行 .

1 回答

  • 1

    您可以使用以下代码查找已删除的数据

    val datatoBeInserted = data.select("id", "is_enabled", "code", "description", "gamme", "import_local", "marque", "type_marketing", "reference", "struct", "type_tarif", "family_id", "range_id", "article_type_id")
    
    val cleanedData = datatoBeInserted.dropDuplicates("id")
    
    val droppedData = datatoBeInserted.except(cleanedData)
    

    祝一切顺利 :)

相关问题