1. 优化原则:小表驱动大表,即小数据集驱动大数据集. select * from A where id in (select id from B) 等价于: for select id from B for select * from A where A.id = B.id 当B表的数据集必须小于A的数据集时,用in优于exists. select * from A where exists (select 1 from B where B.id = A.id) 等价于: for select
在了解之前要先了解对应语法 in 与 exist. IN: select * from A where A.id in (select B.id from B) in后的括号的表达式结果要求之输出一列字段.与之前的搜索字段匹配,匹配到相同则返回对应行. mysql的执行顺序是先执行子查询,然后执行主查询,用子查询的结果按条匹配主查询. EXIST: select * from A where exists(select * from B where B.id= A.id) exist后的括号里则
给出两个表,A和B,A和B表的数据量, 当A小于B时,用exists select * from A where exists (select * from B where A.id=B.id) exists的实现,相当于外表循环,每次循环对内表进行查询? for i in A for j in B if j.id == i.id then .... 相反,如果A大于B的时候,则用in select * from A where id in (select id from B) 这种在逻辑上类似
1.小.大表 join 在小表和大表进行join时,将小表放在前边,效率会高.hive会将小表进行缓存. 2.mapjoin 使用mapjoin将小表放入内存,在map端和大表逐一匹配.从而省去reduce. 样例: select /*+MAPJOIN(b)*/ a.a1,a.a2,b.b2 from tablea a JOIN tableb b ON a.a1=b.b1 在0.7版本号后.也能够用配置来自己主动优化 set hive.auto.convert.join=true;
问题SQL: select p.person_id as personId, p.person_name as personName, p.native_place as nativePlace, ci.company_name as companyName, pp.seal_number as sealNumber, GROUP_CONCAT(pp.major) as major, pp.register_name as registerName from qyt_person p left
需求: 小表数据量20w条左右,大表数据量在4kw条左右,需要根据大表筛选出150w条左右的数据并关联更新小表中5k左右的数据. 性能问题: 对筛选条件中涉及的字段加index后,如下常规的update语句仍耗时半小时左右. UPDATE WMOCDCREPORT.DM_WM_TRADINGALL A SET ( A.RELATIONSHIPNO, A.PACKAGE ) = (SELECT B.RELATIONSHIPNO, CASE ' ' ' ') THEN 'BC' ') THEN 'P