我在scala中有两个数据帧,我通过hive上下文使用sql查询创建它们,请在此处查看df作为图像
另一个数据帧是
请忽略第二个df中重复的 Headers ,我想比较两个数据框中的技能列,并获得等效的角色,技能2和df1中出现的即demand_df,
我在熊猫中试过这个并且能够通过使用以下代码段来实现
-
df1 = pd.DataFrame([["INDIA","XXX","developer","UNKNOWN",121],["INDIA","XXXX","software engineer","UNKNOWN",121],["POLAND","XX","english","KNOWN",122]],列= ['country', 'level','Skill','r2d2','tax'])
-
df2 = pd.DataFrame([[_ "english","NaN","teacher","NaN","NaN"],[20000,"Unknown","NaN","NaN","NaN"],["microsoft","Known","Software Engineer","Microsoft","Enterprise"]],列= ['Skill', 'R2D2','Role','Skill2','Emerging'])
result = df1.merge(df2 [['Skill','Role','Skill2','emerging']],how ='left',left_on ='Skill',right_on ='Skill')
请指导我,因为我是斯卡拉的新手
1 回答
由于您已经创建了两个数据框并希望在技能的基础上加入两个数据框并创建一个新的数据框,其中包含df1和Role,Skill2以及df2中的新数据框 . 你可以通过sqlcontext来做到这一点 . val sqlContext = new org.apache.spark.sql.SQLContext(sc)
使用以下命令将两个数据帧注册为temptable:
df1.registerTempTable("df1")
df2.registerTempTable("df2")
之后,您使用简单的配置单元查询来连接并从数据框中获取所需的列:
val df3 = sqlContext.sql(“选择a . ,b . 来自df1左连接df2 b on(a.skill = b.skill)”)