导图社区 SQL总结
SQL总结+常见的大数据面试SQL题,包含了筛选,聚合,分解,排序,连接,派生,问题,指标等方面的内容。
编辑于2022-05-06 11:41:32SQL总结
筛选
条件筛选
where :返回满足条件的集合(= <> 等)like
case when不支持集合筛选(case when id in (select xxxx) 不支持)case when x 满足条件 then 1 else 0 end
集合筛选
left join
left semi join等效 in (select xx) 只不过是老版本的hive不支持in语法
in /not in (select xxx) 执行计划实际是join
聚合
基本聚合
普通聚合
select key, avg/count/sum from xx group key注:注意null 的影响,COALESCE 来消除
嵌套聚合
select count(distinct)select count(distinct case gender = 'man' then user_id end) as man_cnt
高级聚合
GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
集合聚合
collect_list(列)+group by key:返回按key汇集的集合(有序需要结合 DISTRIBUTE BY key+SORT BY 列)collect_set(列)+group by key:返回按key汇集的去重集合
窗口聚合
字符串聚合
concat(列1,列2):返回"列1列2”concat_ws(分隔符,集合或者是列):返回按分隔符拼接的字符串
分解
集合分解
lateral view +posexplode(数组):返回数组中单个元素+索引(数组个数行)lateral view +explode (数组):返回数组中单个元素(数组个数行)注:lateral view (侧边视图的概念)会跟原表拆解的key 进行笛卡尔应用场景:列转行、数据组融合、
例:+---+-----+|id |value|+---+-----+|1 |1011 ||2 |101 ||3 |1111 ||4 |1 |+---+-----+SELECT id,value,single_value, single_value_indexFROM tempTable lateral view posexplode(split(value,'')) as single_value_index,single_value+---+-----+------------+------------------+|id |value|single_value|single_value_index|+---+-----+------------+------------------+|1 |1011 |1 |0 ||1 |1011 |0 |1 ||1 |1011 |1 |2 ||1 |1011 |1 |3 ||1 |1011 | |4 ||2 |101 |1 |0 ||2 |101 |0 |1 ||2 |101 |1 |2 ||2 |101 | |3 ||3 |1111 |1 |0 ||3 |1111 |1 |1 ||3 |1111 |1 |2 ||3 |1111 |1 |3 ||3 |1111 | |4 ||4 |1 |1 |0 ||4 |1 | |1 |+---+-----+------------+------------------+
字符串分解
split(分隔符,字符串):返回数组
排序
order by全局排序set hive.mapred.mode=strict加了Limit才不会报错
distribute by能够控制map的输出在reduce中如何划分,其可以按照指定的字段对数据进行划分到不同的reduce中(注:distribute by 和group by 有些许相似,distribute by控制reduce如何处理数据,sort by 控制reduce的数据如何排序)
cluster bycluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒叙排序,不能指定排序规则为ASC或者DESC。
连接
行连接
unionunion all
列连接
全连接
半连接
派生
派生行
lateral view explode
笛卡尔
join on 多行相同key=多行相同key
派生列
单行级别
case when 注:连续转离散按条件派生新列
窗口级别
普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。开窗函数一般分为两类,聚合开窗函数和排序返回聚合值 :SUM,AVG,MIN,MAX (支持WINDOW子句)返回聚合值衍生的比例:CUME_DIST,PERCENT_RANK(不支持WINDOW子句)返回排序的序列对应的序号:NTILE,ROW_NUMBER,RANK,DENSE_RANK(不支持WINDOW子句)返回排序的序列对应的值:LAG,LEAD,FIRST_VALUE,LAST_VALUE(不支持WINDOW子句)
聚合
SUM,AVG,MIN,MAX
用于实现分组内所有和连续累积的统计。Hive版本为 apache-hive-0.13.1数据准备CREATE EXTERNAL TABLE lxw1234 (cookieid string,createtime string, --day pv INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/tmp/lxw11/'; DESC lxw1234;cookieid STRING createtime STRING pv INT hive> select * from lxw1234;OKcookie1 2015-04-10 1cookie1 2015-04-11 5cookie1 2015-04-12 7cookie1 2015-04-13 3cookie1 2015-04-14 2cookie1 2015-04-15 4cookie1 2015-04-16 4SUM — 注意,结果和ORDER BY相关,默认为升序SELECT cookieid,createtime,pv,SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 SUM(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行 FROM lxw1234; cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6 -----------------------------------------------------------------------------cookie1 2015-04-10 1 1 1 26 1 6 26cookie1 2015-04-11 5 6 6 26 6 13 25cookie1 2015-04-12 7 13 13 26 13 16 20cookie1 2015-04-13 3 16 16 26 16 18 13cookie1 2015-04-14 2 18 18 26 17 21 10cookie1 2015-04-15 4 22 22 26 16 20 8cookie1 2015-04-16 4 26 26 26 13 13 4 pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号pv2: 同pv1pv3: 分组内(cookie1)所有的pv累加pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10如果不指定ROWS BETWEEN,默认为从起点到当前行;如果不指定ORDER BY,则将分组内所有值累加;关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:PRECEDING:往前FOLLOWING:往后CURRENT ROW:当前行UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点–其他AVG,MIN,MAX,和SUM用法一样。--AVGSELECT cookieid,createtime,pv,AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行 FROM lxw1234; cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6 -----------------------------------------------------------------------------cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0--MINSELECT cookieid,createtime,pv,MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 MIN(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行 FROM lxw1234; cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6 -----------------------------------------------------------------------------cookie1 2015-04-10 1 1 1 1 1 1 1cookie1 2015-04-11 5 1 1 1 1 1 2cookie1 2015-04-12 7 1 1 1 1 1 2cookie1 2015-04-13 3 1 1 1 1 1 2cookie1 2015-04-14 2 1 1 1 2 2 2cookie1 2015-04-15 4 1 1 1 2 2 4cookie1 2015-04-16 4 1 1 1 2 2 4----MAXSELECT cookieid,createtime,pv,MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行 FROM lxw1234; cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6 -----------------------------------------------------------------------------cookie1 2015-04-10 1 1 1 7 1 5 7cookie1 2015-04-11 5 5 5 7 5 7 7cookie1 2015-04-12 7 7 7 7 7 7 7cookie1 2015-04-13 3 7 7 7 7 7 4cookie1 2015-04-14 2 7 7 7 7 7 4cookie1 2015-04-15 4 7 7 7 7 7 4cookie1 2015-04-16 4 7 7 7 4 4 4
CUME_DIST,PERCENT_RANK
注意: 序列函数不支持WINDOW子句。Hive版本为 apache-hive-0.13.1数据准备:d1,user1,1000d1,user2,2000d1,user3,3000d2,user4,4000d2,user5,5000 CREATE EXTERNAL TABLE lxw1234 (dept STRING,userid string,sal INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/tmp/lxw11/'; hive> select * from lxw1234;OKd1 user1 1000d1 user2 2000d1 user3 3000d2 user4 4000d2 user5 5000CUME_DIST–CUME_DIST 小于等于当前值的行数/分组内总行数–比如,统计小于等于当前薪水的人数,所占总人数的比例SELECT dept,userid,sal,CUME_DIST() OVER(ORDER BY sal) AS rn1,CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM lxw1234; dept userid sal rn1 rn2 -------------------------------------------d1 user1 1000 0.2 0.3333333333333333d1 user2 2000 0.4 0.6666666666666666d1 user3 3000 0.6 1.0d2 user4 4000 0.8 0.5d2 user5 5000 1.0 1.0 rn1: 没有partition,所有数据均为1组,总行数为5, 第一行:小于等于1000的行数为1,因此,1/5=0.2 第三行:小于等于3000的行数为3,因此,3/5=0.6rn2: 按照部门分组,dpet=d1的行数为3, 第二行:小于等于2000的行数为2,因此,2/3=0.6666666666666666PERCENT_RANK–PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1应用场景不了解,可能在一些特殊算法的实现中可以用到吧。SELECT dept,userid,sal,PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分组内RANK() OVER(ORDER BY sal) AS rn11, --分组内RANK值SUM(1) OVER(PARTITION BY NULL) AS rn12, --分组内总行数PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM lxw1234; dept userid sal rn1 rn11 rn12 rn2---------------------------------------------------d1 user1 1000 0.0 1 5 0.0d1 user2 2000 0.25 2 5 0.5d1 user3 3000 0.5 3 5 1.0d2 user4 4000 0.75 4 5 0.0d2 user5 5000 1.0 5 5 1.0 rn1: rn1 = (rn11-1) / (rn12-1) 第一行,(1-1)/(5-1)=0/4=0 第二行,(2-1)/(5-1)=1/4=0.25 第四行,(4-1)/(5-1)=3/4=0.75rn2: 按照dept分组, dept=d1的总行数为3 第一行,(1-1)/(3-1)=0 第三行,(3-1)/(3-1)=1
排序
NTILE,ROW_NUMBER,RANK,DENSE_RANK
注意: 序列函数不支持WINDOW子句。数据准备:cookie1,2015-04-10,1cookie1,2015-04-11,5cookie1,2015-04-12,7cookie1,2015-04-13,3cookie1,2015-04-14,2cookie1,2015-04-15,4cookie1,2015-04-16,4cookie2,2015-04-10,2cookie2,2015-04-11,3cookie2,2015-04-12,5cookie2,2015-04-13,6cookie2,2015-04-14,3cookie2,2015-04-15,9cookie2,2015-04-16,7 CREATE EXTERNAL TABLE lxw1234 (cookieid string,createtime string, --day pv INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/tmp/lxw11/'; DESC lxw1234;cookieid STRING createtime STRING pv INT hive> select * from lxw1234;OKcookie1 2015-04-10 1cookie1 2015-04-11 5cookie1 2015-04-12 7cookie1 2015-04-13 3cookie1 2015-04-14 2cookie1 2015-04-15 4cookie1 2015-04-16 4cookie2 2015-04-10 2cookie2 2015-04-11 3cookie2 2015-04-12 5cookie2 2015-04-13 6cookie2 2015-04-14 3cookie2 2015-04-15 9cookie2 2015-04-16 7NTILENTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)如果切片不均匀,默认增加第一个切片的分布SELECT cookieid,createtime,pv,NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1, --分组内将数据分成2片NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2, --分组内将数据分成3片NTILE(4) OVER(ORDER BY createtime) AS rn3 --将所有数据分成4片FROM lxw1234 ORDER BY cookieid,createtime; cookieid day pv rn1 rn2 rn3-------------------------------------------------cookie1 2015-04-10 1 1 1 1cookie1 2015-04-11 5 1 1 1cookie1 2015-04-12 7 1 1 2cookie1 2015-04-13 3 1 2 2cookie1 2015-04-14 2 2 2 3cookie1 2015-04-15 4 2 3 3cookie1 2015-04-16 4 2 3 4cookie2 2015-04-10 2 1 1 1cookie2 2015-04-11 3 1 1 1cookie2 2015-04-12 5 1 1 2cookie2 2015-04-13 6 1 2 2cookie2 2015-04-14 3 2 2 3cookie2 2015-04-15 9 2 3 4cookie2 2015-04-16 7 2 3 4–比如,统计一个cookie,pv数最多的前1/3的天SELECT cookieid,createtime,pv,NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn FROM lxw1234; --rn = 1 的记录,就是我们想要的结果 cookieid day pv rn----------------------------------cookie1 2015-04-12 7 1cookie1 2015-04-11 5 1cookie1 2015-04-15 4 1cookie1 2015-04-16 4 2cookie1 2015-04-13 3 2cookie1 2015-04-14 2 3cookie1 2015-04-10 1 3cookie2 2015-04-15 9 1cookie2 2015-04-16 7 1cookie2 2015-04-13 6 1cookie2 2015-04-12 5 2cookie2 2015-04-14 3 2cookie2 2015-04-11 3 3cookie2 2015-04-10 2 3ROW_NUMBERROW_NUMBER() –从1开始,按照顺序,生成分组内记录的序列–比如,按照pv降序排列,生成分组内每天的pv名次ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录;获取一个session中的第一条refer等。 SELECT cookieid,createtime,pv,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn FROM lxw1234; cookieid day pv rn------------------------------------------- cookie1 2015-04-12 7 1cookie1 2015-04-11 5 2cookie1 2015-04-15 4 3cookie1 2015-04-16 4 4cookie1 2015-04-13 3 5cookie1 2015-04-14 2 6cookie1 2015-04-10 1 7cookie2 2015-04-15 9 1cookie2 2015-04-16 7 2cookie2 2015-04-13 6 3cookie2 2015-04-12 5 4cookie2 2015-04-14 3 5cookie2 2015-04-11 3 6cookie2 2015-04-10 2 7RANK 和 DENSE_RANK—RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位—DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位 SELECT cookieid,createtime,pv,RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 FROM lxw1234 WHERE cookieid = 'cookie1'; cookieid day pv rn1 rn2 rn3 -------------------------------------------------- cookie1 2015-04-12 7 1 1 1cookie1 2015-04-11 5 2 2 2cookie1 2015-04-15 4 3 3 3cookie1 2015-04-16 4 3 3 4cookie1 2015-04-13 3 5 4 5cookie1 2015-04-14 2 6 5 6cookie1 2015-04-10 1 7 6 7 rn1: 15号和16号并列第3, 13号排第5rn2: 15号和16号并列第3, 13号排第4rn3: 如果相等,则按记录值排序,生成唯一的次序,如果所有记录值都相等,或许会随机排吧。
LAG,LEAD,FIRST_VALUE,LAST_VALUE
注意: 这几个函数不支持WINDOW子句。Hive版本为 apache-hive-0.13.1数据准备:cookie1,2015-04-10 10:00:02,url2cookie1,2015-04-10 10:00:00,url1cookie1,2015-04-10 10:03:04,1url3cookie1,2015-04-10 10:50:05,url6cookie1,2015-04-10 11:00:00,url7cookie1,2015-04-10 10:10:00,url4cookie1,2015-04-10 10:50:01,url5cookie2,2015-04-10 10:00:02,url22cookie2,2015-04-10 10:00:00,url11cookie2,2015-04-10 10:03:04,1url33cookie2,2015-04-10 10:50:05,url66cookie2,2015-04-10 11:00:00,url77cookie2,2015-04-10 10:10:00,url44cookie2,2015-04-10 10:50:01,url55 CREATE EXTERNAL TABLE lxw1234 (cookieid string,createtime string, --页面访问时间url STRING --被访问页面) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/tmp/lxw11/'; hive> select * from lxw1234;OKcookie1 2015-04-10 10:00:02 url2cookie1 2015-04-10 10:00:00 url1cookie1 2015-04-10 10:03:04 1url3cookie1 2015-04-10 10:50:05 url6cookie1 2015-04-10 11:00:00 url7cookie1 2015-04-10 10:10:00 url4cookie1 2015-04-10 10:50:01 url5cookie2 2015-04-10 10:00:02 url22cookie2 2015-04-10 10:00:00 url11cookie2 2015-04-10 10:03:04 1url33cookie2 2015-04-10 10:50:05 url66cookie2 2015-04-10 11:00:00 url77cookie2 2015-04-10 10:10:00 url44cookie2 2015-04-10 10:50:01 url55LAGLAG(col,n,DEFAULT) 用于统计窗口内往上第n行值第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM lxw1234; cookieid createtime url rn last_1_time last_2_time-------------------------------------------------------------------------------------------cookie1 2015-04-10 10:00:00 url1 1 1970-01-01 00:00:00 NULLcookie1 2015-04-10 10:00:02 url2 2 2015-04-10 10:00:00 NULLcookie1 2015-04-10 10:03:04 1url3 3 2015-04-10 10:00:02 2015-04-10 10:00:00cookie1 2015-04-10 10:10:00 url4 4 2015-04-10 10:03:04 2015-04-10 10:00:02cookie1 2015-04-10 10:50:01 url5 5 2015-04-10 10:10:00 2015-04-10 10:03:04cookie1 2015-04-10 10:50:05 url6 6 2015-04-10 10:50:01 2015-04-10 10:10:00cookie1 2015-04-10 11:00:00 url7 7 2015-04-10 10:50:05 2015-04-10 10:50:01cookie2 2015-04-10 10:00:00 url11 1 1970-01-01 00:00:00 NULLcookie2 2015-04-10 10:00:02 url22 2 2015-04-10 10:00:00 NULLcookie2 2015-04-10 10:03:04 1url33 3 2015-04-10 10:00:02 2015-04-10 10:00:00cookie2 2015-04-10 10:10:00 url44 4 2015-04-10 10:03:04 2015-04-10 10:00:02cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:10:00 2015-04-10 10:03:04cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 10:50:01 2015-04-10 10:10:00cookie2 2015-04-10 11:00:00 url77 7 2015-04-10 10:50:05 2015-04-10 10:50:01 last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00' cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00 cookie1第三行,往上1行值为第二行值,2015-04-10 10:00:02 cookie1第六行,往上1行值为第五行值,2015-04-10 10:50:01last_2_time: 指定了往上第2行的值,为指定默认值 cookie1第一行,往上2行为NULL cookie1第二行,往上2行为NULL cookie1第四行,往上2行为第二行值,2015-04-10 10:00:02 cookie1第七行,往上2行为第五行值,2015-04-10 10:50:01LEAD与LAG相反LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time FROM lxw1234; cookieid createtime url rn next_1_time next_2_time -------------------------------------------------------------------------------------------cookie1 2015-04-10 10:00:00 url1 1 2015-04-10 10:00:02 2015-04-10 10:03:04cookie1 2015-04-10 10:00:02 url2 2 2015-04-10 10:03:04 2015-04-10 10:10:00cookie1 2015-04-10 10:03:04 1url3 3 2015-04-10 10:10:00 2015-04-10 10:50:01cookie1 2015-04-10 10:10:00 url4 4 2015-04-10 10:50:01 2015-04-10 10:50:05cookie1 2015-04-10 10:50:01 url5 5 2015-04-10 10:50:05 2015-04-10 11:00:00cookie1 2015-04-10 10:50:05 url6 6 2015-04-10 11:00:00 NULLcookie1 2015-04-10 11:00:00 url7 7 1970-01-01 00:00:00 NULLcookie2 2015-04-10 10:00:00 url11 1 2015-04-10 10:00:02 2015-04-10 10:03:04cookie2 2015-04-10 10:00:02 url22 2 2015-04-10 10:03:04 2015-04-10 10:10:00cookie2 2015-04-10 10:03:04 1url33 3 2015-04-10 10:10:00 2015-04-10 10:50:01cookie2 2015-04-10 10:10:00 url44 4 2015-04-10 10:50:01 2015-04-10 10:50:05cookie2 2015-04-10 10:50:01 url55 5 2015-04-10 10:50:05 2015-04-10 11:00:00cookie2 2015-04-10 10:50:05 url66 6 2015-04-10 11:00:00 NULLcookie2 2015-04-10 11:00:00 url77 7 1970-01-01 00:00:00 NULL --逻辑与LAG一样,只不过LAG是往上,LEAD是往下。FIRST_VALUE取分组内排序后,截止到当前行,第一个值SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 FROM lxw1234; cookieid createtime url rn first1---------------------------------------------------------cookie1 2015-04-10 10:00:00 url1 1 url1cookie1 2015-04-10 10:00:02 url2 2 url1cookie1 2015-04-10 10:03:04 1url3 3 url1cookie1 2015-04-10 10:10:00 url4 4 url1cookie1 2015-04-10 10:50:01 url5 5 url1cookie1 2015-04-10 10:50:05 url6 6 url1cookie1 2015-04-10 11:00:00 url7 7 url1cookie2 2015-04-10 10:00:00 url11 1 url11cookie2 2015-04-10 10:00:02 url22 2 url11cookie2 2015-04-10 10:03:04 1url33 3 url11cookie2 2015-04-10 10:10:00 url44 4 url11cookie2 2015-04-10 10:50:01 url55 5 url11cookie2 2015-04-10 10:50:05 url66 6 url11cookie2 2015-04-10 11:00:00 url77 7 url11LAST_VALUE取分组内排序后,截止到当前行,最后一个值SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 FROM lxw1234; cookieid createtime url rn last1 -----------------------------------------------------------------cookie1 2015-04-10 10:00:00 url1 1 url1cookie1 2015-04-10 10:00:02 url2 2 url2cookie1 2015-04-10 10:03:04 1url3 3 1url3cookie1 2015-04-10 10:10:00 url4 4 url4cookie1 2015-04-10 10:50:01 url5 5 url5cookie1 2015-04-10 10:50:05 url6 6 url6cookie1 2015-04-10 11:00:00 url7 7 url7cookie2 2015-04-10 10:00:00 url11 1 url11cookie2 2015-04-10 10:00:02 url22 2 url22cookie2 2015-04-10 10:03:04 1url33 3 1url33cookie2 2015-04-10 10:10:00 url44 4 url44cookie2 2015-04-10 10:50:01 url55 5 url55cookie2 2015-04-10 10:50:05 url66 6 url66cookie2 2015-04-10 11:00:00 url77 7 url77如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果SELECT cookieid,createtime,url,FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2 FROM lxw1234; cookieid createtime url first2----------------------------------------------cookie1 2015-04-10 10:00:02 url2 url2cookie1 2015-04-10 10:00:00 url1 url2cookie1 2015-04-10 10:03:04 1url3 url2cookie1 2015-04-10 10:50:05 url6 url2cookie1 2015-04-10 11:00:00 url7 url2cookie1 2015-04-10 10:10:00 url4 url2cookie1 2015-04-10 10:50:01 url5 url2cookie2 2015-04-10 10:00:02 url22 url22cookie2 2015-04-10 10:00:00 url11 url22cookie2 2015-04-10 10:03:04 1url33 url22cookie2 2015-04-10 10:50:05 url66 url22cookie2 2015-04-10 11:00:00 url77 url22cookie2 2015-04-10 10:10:00 url44 url22cookie2 2015-04-10 10:50:01 url55 url22 SELECT cookieid,createtime,url,LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2 FROM lxw1234; cookieid createtime url last2----------------------------------------------cookie1 2015-04-10 10:00:02 url2 url5cookie1 2015-04-10 10:00:00 url1 url5cookie1 2015-04-10 10:03:04 1url3 url5cookie1 2015-04-10 10:50:05 url6 url5cookie1 2015-04-10 11:00:00 url7 url5cookie1 2015-04-10 10:10:00 url4 url5cookie1 2015-04-10 10:50:01 url5 url5cookie2 2015-04-10 10:00:02 url22 url55cookie2 2015-04-10 10:00:00 url11 url55cookie2 2015-04-10 10:03:04 1url33 url55cookie2 2015-04-10 10:50:05 url66 url55cookie2 2015-04-10 11:00:00 url77 url55cookie2 2015-04-10 10:10:00 url44 url55cookie2 2015-04-10 10:50:01 url55 url55如果想要取分组内排序后最后一个值,则需要变通一下:SELECT cookieid,createtime,url,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 FROM lxw1234 ORDER BY cookieid,createtime; cookieid createtime url rn last1 last2-------------------------------------------------------------cookie1 2015-04-10 10:00:00 url1 1 url1 url7cookie1 2015-04-10 10:00:02 url2 2 url2 url7cookie1 2015-04-10 10:03:04 1url3 3 1url3 url7cookie1 2015-04-10 10:10:00 url4 4 url4 url7cookie1 2015-04-10 10:50:01 url5 5 url5 url7cookie1 2015-04-10 10:50:05 url6 6 url6 url7cookie1 2015-04-10 11:00:00 url7 7 url7 url7cookie2 2015-04-10 10:00:00 url11 1 url11 url77cookie2 2015-04-10 10:00:02 url22 2 url22 url77cookie2 2015-04-10 10:03:04 1url33 3 1url33 url77cookie2 2015-04-10 10:10:00 url44 4 url44 url77cookie2 2015-04-10 10:50:01 url55 5 url55 url77cookie2 2015-04-10 10:50:05 url66 6 url66 url77cookie2 2015-04-10 11:00:00 url77 7 url77 url77提示:在使用分析函数的过程中,要特别注意ORDER BY子句,用的不恰当,统计出的结果就不是你所期望的。
问题
排序--快速取top n
sort by可以启动多个reduce,每个reduce做局部排序,但是这对于sort by limit N已经够用了。 从执行计划中可以看出sort by limit N启动了两个JOB。第一个JOB是在每个reduce中做局部排序,然后分别取TOP N。假设启动了M个reduce,第二个JOB再对M个reduce分别局部排好序的总计M * N条数据做全局排序,取TOP N,从而得到想要的结果。这样就可以大大提高SELECT TOP N的效率。
排序--快速获得倾斜键
set mapred.reduce.tasks = 5;select idA,count(*) as numfrom tableAgroup by idAdistribute by idAsort by num desc limit 10
排序--全局排序避免一个reduce
select * from(select name,Company from haha distribute by Company sort by Company)t order by Company这个是仅仅是优化了 但是最终还得用一个reduce
聚合--count(distinct)避免一个reduce
单字段:SELECT COUNT(1) AS sex_uni_cnt FROM (SELECT sex FROM employee group sex) a;多字段:SELECT FROM (SELECT count(distinct coalesce(sex,'0'),coalesce(id,'0')) FROM employee group sex,id) a;去除null 的影响
聚合--日志操作序列 有序
WITH table1 AS ( SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL SELECT 'B' AS ID, 'green' AS event, 1 AS order_num)-- Collect itSELECT subquery.ID, collect_list(subquery.order_num) as events_listFROM (SELECT table1.ID,table1.event,table1.order_num FROM table1 DISTRIBUTE BY table1.ID SORT BY table1.ID, table1.order_num) subqueryGROUP BY subquery.ID;还有一种方法是使用数组的排序 sort_array
聚合--行转列 两种方式
开始测试需求1多行转多列,数据源是表1表1数据:name subject scorejack english 70jack math 80jack chinese 90tim english 10tim math 20tim chinese 30结果表:name english math chinesejack 70 80 90tim 10 20 30group by + max + case when语法巧妙的使用了数值型和0之间的大小关系select name,max(case subject when 'english' then score else 0 end) as english,max(case subject when 'math' then score else 0 end) as math,max(case subject when 'chinese' then score else 0 end) as chinesefrom school_final_testgroup by name;疑问: 如果需要多行转多列的不是数值型,而是str类型是否仍可行呢?select max(str) from (select 'str' as str union all select 'sts' as str union all select null as str)t1; -- result : sts很明显,如果是字符串,可以使用null作为默认值,那么max仍然可行需求2多行转单列,数据源是表1结果表:name scoresjack english:70,math:80,chinese:90tim english:10,math:20,chinese:30group by + collect_list + concat_ws语法注意:函数中要求的类型select name,concat_ws(',', collect_list( concat_ws(':',subject,cast(score as string)) ) ) as scoresfrom school_final_testgroup by name;
聚合--collect_list有序
WITH table1 AS ( SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL SELECT 'B' AS ID, 'green' AS event, 1 AS order_num)-- Collect itSELECT subquery.ID, collect_list(subquery.order_num) as events_listFROM (SELECT table1.ID, table1.event, table1.order_num FROM table1 DISTRIBUTE BY table1.ID SORT BY table1.ID, table1.order_num) subqueryGROUP BY subquery.ID;还有一种方法是使用数组的排序
分解--列转行两种方式
需求3多列转多行,数据源是表2表2数据:name english math chinesejack 70 80 90tim 10 20 30结果表:name subject scorejack english 70jack math 80jack chinese 90tim english 10tim math 20tim chinese 30将每列需要的数据单独提取出来合并即可注意: union和union all的区别是是否去重select name,'english' as subject,english as score from school_final_test1 union all select name,'math' as subject,math as score from school_final_test1 union all select name,'chinese' as subject,chinese as score from school_final_test1;需求4单列转多行数据源:name scoresjack english:70,math:80,chinese:90tim english:10,math:20,chinese:30结果表:name scoresjack english:70jack math:80jack chinese:90tim english:10tim math:20tim chinese:30split + explode语法Lateral View是一个连接用户自定义函数(UDFs)的连接词,可以有多个;如果炸裂的列中有null值但是需要显示可以在view后面加outer,类似表的左外连接;其中table1表示开窗得到的新表别名,必须写但可不使用(不使用时必须保证没有重复的列名)select name,table1.scores as scoresfrom school_final_test2lateral view explode(split(scores,',')) table1 as scores;如果需要将结果表转换成表1的形式,可以继续对scores字段按照:进行拆分,然后用数组角标方式{{{arrya[index]}}}获取对应字段select name,split(scores,':')[0] as subject,split(scores,':')[1] as score from ( select name,table1.scores as scores from school_final_test2 lateral view explode(split(scores,',')) table1 as scores)t1;
分解--string to map
需求5将字符串"keyl=valuel&key2=value2...keyn=valuen"进行分割放到一个字段里面,可以查出任意一个keyx对应的valuex值核心思路就是将字符串内数据疯转在map结构中,可以使用内置函数str_to_map,如果有其他更复杂的操作可自定义UDF;str_to_map(text, delimiter1, delimiter2) - Creates a map by parsing text; Split text into key-value pairs using two delimiters. The first delimiter seperates pairs, and the second delimiter sperates key and value. If only one parameter is given, default delimiters are used: ',' as delimiter1 and '=' as delimiter2.*select result['key1'] from (select str_to_map('key1=value1&key2=value2&keyn=valuen','&','=') as result)t1;
分解--不依赖其他表生成1000条数据
select * from (select ',,' as str)tLATERAL VIEW posexplode(split(str,',')) t as str1,genidLATERAL VIEW posexplode(split(str,',')) t as str1,genid
分解--字符串包含的关键字索引位置
SELECT id, value, concat_ws(',',collect_list(single_value_index)) as indicesFROM( SELECT id, value, single_value, cast(single_value_index+1 as string) as single_value_index FROM tempTable lateral view posexplode(split(value,'')) as single_value_index,single_value)GROUP BY id,value
分解--两个同大小的数组融合
select * from (select ',,' as str)tLATERAL VIEW posexplode(split(str,',')) t as str1,genidLATERAL VIEW posexplode(split(str,',')) t as str1,genidwhere genid=genid
筛选--谓词下推歧义
## 问题:坑--谓词下推逻辑歧义###原理:```textcreate table mdw_tmp_dev.test_atable asselect 1 as id,"lijie1" as name ,100 as grade,10 as dept union all select 2 as id,"lijie2" as name ,90 as grade,20 as dept union allselect 3 as id,"lijie3" as name ,80 as grade,10 as dept union allselect 4 as id,"lijie4" as name ,60 as grade,10 as dept union allselect 5 as id,"lijie5" as name ,50 as grade,20 as deptcreate table mdw_tmp_dev.test_btable asselect 10 as id,"IT1" as name union allselect 20 as id,"IT2" as name 谓词下推了select A.id,A.name,A.grade,A.dept,B.id,B.name from mdw_tmp_dev.test_atable A left outer join mdw_tmp_dev.test_btable Bon A.dept = B.id where A.grade >=80 and B.name is null结果没数据 谓词没有下推 select A.id,A.name,A.grade,A.dept,B.id,B.name from mdw_tmp_dev.test_atable A left outer join mdw_tmp_dev.test_btable Bon A.dept = B.id and A.grade >=80 --A.grade >=80 没生效where B.name is null结果为+-------+---------+----------+---------+-------+---------+--+| a.id | a.name | a.grade | a.dept | b.id | b.name |+-------+---------+----------+---------+-------+---------+--+| 4 | lijie4 | 60 | 10 | NULL | NULL || 5 | lijie5 | 50 | 20 | NULL | NULL |+-------+---------+----------+---------+-------+---------+--+LEFT JOIN 过滤条件在ON中,MR&Spark过滤字段取自左表不下推,右表下推LEFT JOIN 过滤条件在WHERE中,且为ADN关系,过滤字段为关联主键MR&Spark均会进行谓词下推LEFT JOIN 过滤条件在WHERE中,且为ADN关系,过滤字段非关联主键取自左表统统下推,引擎为MR右表不下推,为Spark右表 下推LEFT JOIN 过滤条件在WHERE中,且为OR关系,MR&Spark均不会进行谓词下推RIGHT JOIN 反之JOIN外的条件(where条件中),不含OR操作符,MR&Spark都会进行谓词下推JOIN中的条件(ON条件中),MR&Spark都会进行谓词下推JOIN外的条件(where条件中),含OR操作符,如果过滤字段与JOIN主键一致,Spark会进行谓词下推,MR不会进行谓词下推JOIN外的条件(where条件中),含OR操作符,过滤字段与JOIN主键不一致,Spark&MR均不会进行谓词下推fullouter join所有表的过滤表达式都不能下推。作者:CCCCCold丶kl链接:https://juejin.im/post/6844903976366768142来源:掘金著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。 https://zhuanlan.zhihu.com/p/78266517```
sql里面on和where有区别吗
数据库在通过连接两张或多张表来返回记录时,都会生成一张中间的临时表以 LEFT JOIN 为例:在使用 LEFT JOIN 时,ON 和 WHERE 过滤条件的区别如下:on 条件是在生成临时表时使用的条件,它不管 on 中的条件是否为真,都会返回左边表中的记录where 条件是在临时表生成好后,再对临时表进行过滤的条件。这时已经没有 left join 的含义(必须返回左边表的记录)了,条件不为真的就全部过滤掉
连接--天级别UV累积(快速计算 3天uv)
需要优化的模型表字段【」登录时间,用户ID,单日PV,当日UV,历史累计】
连接--增量合并全量
连接+派生--拉链表操作
insert overwrite table dw.account_hist select * from ( select a.id, a.balance, a.status, a.start_date case when a.end_date = '9999-99-99' and b.id is not null then '${YESTERDAY}' else a.end_date end as end_date from dw.account_hist as a left outer join ods.account_update as b on b.id = a.id union all select a.id, a.balance, a.status, ${TODAY} as start_date, '9999-99-99' as end_date from ods.account_update as a ) as t
派生--累积值 非开窗实现与开窗实现
问题 sql--开窗函数--求累积值```texthive> select * from record;OKA 2015-01 5A 2015-01 15B 2015-01 5A 2015-01 8B 2015-01 25A 2015-01 5A 2015-02 4A 2015-02 6B 2015-02 10B 2015-02 5A 2015-03 16A 2015-03 22B 2015-03 23B 2015-03 10B 2015-03 11想要的结果用户 月份 本月访问次数 截止到当月总访问次数 截止到当月最大访问次数A 2015-01 33 33 33A 2015-02 10 43 33A 2015-03 38 81 38B 2015-01 30 30 30B 2015-02 15 45 30B 2015-03 44 89 44```### 使用开窗```sqlselect userid, month, count, sum(count) over(partition by userid order by month) as sum_count,max(count) over(partition by userid order by month) as max_countfrom record_2;```### 不使用开窗函数 [原理翻倍筛选]```SQLSELECT t1.userid, t1.month, t1.count, sum(t2.count) sum_count, max(t2.count) max_countFROM record_2 t1 INNER JOIN record_2 t2ON t1.userid = t2.useridWHERE t1.month >= t2.month GROUP BY t1.userid, t1.month, t1.count ORDER BY t1.userid, t1.month;
派生--连续登陆变种非连续登陆
问题 sql--开窗函数--统计用户连续交易的总额、连续登陆天数、连续登陆开始和结束时间、间隔天数```textcreate table deal_tb (id int, daystr varchar(10), amount decimal(10,2));insert into deal_tb values(1, "2019-02-08", 6214.23), (1, "2019-02-08", 6247.32), (1, "2019-02-09", 85.63), (1, "2019-02-09", 967.36), (1, "2019-02-10", 85.69), (1, "2019-02-12", 769.85), (1, "2019-02-13", 943.86), (1, "2019-02-14", 538.42),(1, "2019-02-15", 369.76),(1, "2019-02-16", 369.76),(1, "2019-02-18", 795.15),(1, "2019-02-19", 715.65),(1, "2019-02-21", 537.71),(2, "2019-02-08", 6214.23), (2, "2019-02-08", 6247.32), (2, "2019-02-09", 85.63), (2, "2019-02-09", 967.36), (2, "2019-02-10", 85.69), (2, "2019-02-12", 769.85), (2, "2019-02-13", 943.86), (2, "2019-02-14", 943.18),(2, "2019-02-15", 369.76),(2, "2019-02-18", 795.15),(2, "2019-02-19", 715.65),(2, "2019-02-21", 537.71),(3, "2019-02-08", 6214.23), (3, "2019-02-08", 6247.32), (3, "2019-02-09", 85.63), (3, "2019-02-09", 967.36), (3, "2019-02-10", 85.69), (3, "2019-02-12", 769.85), (3, "2019-02-13", 943.86), (3, "2019-02-14", 276.81),(3, "2019-02-15", 369.76),(3, "2019-02-16", 369.76),(3, "2019-02-18", 795.15),(3, "2019-02-19", 715.65),(3, "2019-02-21", 537.71);结果id sum_amount start_date end_date continuous_day interval_day1 13600.23 2019-02-08 2019-02-10 3 01 2991.65 2019-02-12 2019-02-16 5 11 -100.00 2019-02-18 2019-02-19 2 11 -100.00 2019-02-21 2019-02-21 1 12 13600.23 2019-02-08 2019-02-10 3 02 3026.65 2019-02-12 2019-02-15 4 12 -100.00 2019-02-18 2019-02-19 2 22 -100.00 2019-02-21 2019-02-21 1 13 13600.23 2019-02-08 2019-02-10 3 03 2730.04 2019-02-12 2019-02-16 5 13 -100.00 2019-02-18 2019-02-19 2 13 -100.00 2019-02-21 2019-02-21 1 1``````sql#去重select id, case when count(1) >= 3 then sum(amount) else -100 end sum_amount, -- 连续大于三天的交易总额,不满足的赋值为-100 min(daystr) start_date, -- 连续登陆的开始时间 max(daystr) end_date, -- 连续登陆的结束时间 count(1) continuous_day, -- 连续登陆的天数 datediff(from_day, lag(from_day, 1, from_day) over(partition by id order by from_day)) interval_day-- 间隔多少天没交易from(-- 根据用户ID分组按日期排序,将日期和分组序号相减得到连续登陆的开始日期,如果开始日期相同说明连续登陆selectid, daystr, amount, date_sub(daystr, row_number() over(partition by id order by daystr)) from_day -- 日期减去分区排序的序号,如果新日期相等,则连续登陆from(-- 按用户和日期分区求和,使每个用户每天只有一条数据select id, daystr, sum(amount) amountfrom deal_tbgroup by id, daystr) a) bgroup byid, from_day;```
派生--用户留存(数、率)、用户活跃[多日去重](数、率)、沉默用户(数、率)、流失用户、回流用户(数、率) 计算多方法
select day1,count(distinct a.uid) 活跃,DAU count(distinct case when day2-day1=1 then a.uid end) 次留, count(distinct case when day2-day1=3 then a.uid end) 三留, count(distinct case when day2-day1=7 then a.uid end) 七留, concat(count(distinct case when day2-day1=1 then a.uid end)/count(distinct a.uid)*100,'%') 次日留存率, concat(count(distinct case when day2-day1=3 then a.uid end)/count(distinct a.uid)*100,'%') 三日留存率, concat(count(distinct case when day2-day1=7 then a.uid end)/count(distinct a.uid)*100,'%') 七日留存率from (select uid,date_format(dayno,'%Y%m%d') day1 from aui ) a left join (select uid,date_format(dayno,'%Y%m%d') day2 from aui) bon a.uid=b.uidgroup by day1;方法二/*计算某日新增登录设备的次日、3日、7日、14日、30日、90日留存率*/SELECT log_day '日期', count(user_id_d0) '新增数量', count(user_id_d1) / count(user_id_d0) '次日留存', count(user_id_d3) / count(user_id_d0) '3日留存', count(user_id_d7) / count(user_id_d0) '7日留存', count(user_id_d14) / count(user_id_d0) '14日留存', count(user_id_d30) / count(user_id_d0) '30日留存', count(user_id_d90) / count(user_id_d0) '90日留存'FROM ( SELECT DISTINCT log_day, a.user_id_d0, b.device_id AS user_id_d1, c.device_id AS user_id_d3, d.device_id AS user_id_d7, e.device_id AS user_id_d14, f.device_id AS user_id_d30, g.device_id AS user_id_d90 FROM ( SELECT DISTINCT Date(event_time) AS log_day,device_id AS user_id_d0 FROM role_login_back GROUP BY device_id ORDER BY log_day ) a LEFT JOIN role_login_back b ON DATEDIFF(DATE(b.event_time),a.log_day) = 1 AND a.user_id_d0 = b.device_id LEFT JOIN role_login_back c ON DATEDIFF(DATE(c.event_time),a.log_day) = 2 AND a.user_id_d0 = c.device_id LEFT JOIN role_login_back d ON DATEDIFF(DATE(d.event_time),a.log_day) = 6 AND a.user_id_d0 = d.device_id LEFT JOIN role_login_back e ON DATEDIFF(DATE(e.event_time),a.log_day) = 13 AND a.user_id_d0 = e.device_id LEFT JOIN role_login_back f ON DATEDIFF(DATE(f.event_time),a.log_day) = 29 AND a.user_id_d0 = f.device_id LEFT JOIN role_login_back g ON DATEDIFF(DATE(g.event_time),a.log_day) = 89 AND a.user_id_d0 = g.device_id ) AS tempGROUP BYlog_day;方法三结合UDF
派生--间隔时长>5分钟为有效点击,求有效点击记录
sql--开窗函数--间隔时长>5分钟为有效点击,求有效点击记录```textpass_uid ,event_time``````sqlselect pass_uid,case when duration is null then 1 when duration >300000 then 1 else 0 end as if_jifeifrom ( select pass_uid ,event_time,(lead(event_time,1) over(partition by pass_uid order by event_time asc )-event_time) as duration from t )tt##lead 是往上
派生--用随机Key的方式去除数据倾斜
-- 正常写法 select key , count(1) as cnt from tb_name group by key; -- 改进后写法 select a.key , sum(cnt) as cnt from (select key , if(key = 'key001',random(),0) , count(1) as cnt from tb_name group by key, if(key = 'key001',random(),0) ) t group by t.key;
派生--用笛卡尔积处理数据倾斜
join +union all 处理数据倾斜
连接/union/窗口 计算相互关注
表table_relationfrom_user,to_user
方法一select a.from_user, a.to_user, if(b.from_user is not null, 1, 0) as is_friendfrom table_relation a left join table_relation bon a.from_user = b.to_user and a.to_user = b.from_user
方法二select from_user,to_user,count(1)from(select from_user,to_userfrom table_relationunion all select to_user,from_userfrom table_relation)tgroup by from_user,to_userhaving count(1)>1
方法3select a.from_user, a.to_user, if( sum(1) over (partition by feature) > 1, 1, 0) as is_friendfrom ( select a.from_user, a.to_user, if(from_user > to_user, concat(to_user, from_user), concat(from_user, to_user)) as feature from table_relation )a
pv uv 数据倾斜 怎么解决 对应的 key values partition 分别是什么
直播同时在线的最多人数
select max(max_index)from ( select sum(index) over(order by `timestamp`) as max_index --排序后第一行到本行的和 from ( select order_id, unix_timestamp(login_time) as `timestamp`, 1 as index from connection_detail where dt = '20190101' and is_td_finish = 1 union all select order_id, unix_timestamp(logout_time) as `timestamp`, -1 as index from connection_detail where dt = '20190101' )a --将登录时间和登出时间多列成多行)b————————————————版权声明:本文为CSDN博主「XinyuWg」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。原文链接:https://blog.csdn.net/adrian_wang/article/details/89840671
数组搜寻
事件:A (注册)、B(登陆) 、C (浏览)table name:Tcolumn:user_id、event(A 、B 、C)case:user_id event1111 B1111 A1111 A1111 C1112 A1113 D访问过A页面用户数访问过A&B页面用户数访问过A&B&C页面用户数
随机key的方式解决join 数据倾斜
左表rand0~9 右表union all 10次
解析json数组
SELECT id, get_json_object(input_path_json_col, '$.path') AS input_pathFROM ( SELECT id, input_path_json_col FROM ( SELECT id , split( regexp_replace( regexp_extract(input_path, '^\\[(.+)\\]$', 1) , '\\}\\,\\{' , '\\}\\|\\|\\{') , '\\|\\|') AS input_path_json FROM ods.o_t_job_spark_bigdata_test ) tmp LATERAL VIEW explode(tmp.input_path_json) spark_job AS input_path_json_col) exp
指标
AARRR模型+人货场+对比分析(同比环比,竞品)
A拉新
新增注册用户数/率
激活率
注册转化率
新客页面停留时长
归因 fr 电召触达 短信触达
问题:如果出现很多渠道怎么归因答:根据时间算权重,通过哪个链接来的
A活跃
DAU/MAU(活跃,沉默)(分钟,小时,天,周,月)
日均使用时长
启动APP时长
启动APP次数
连续登陆、不连续登陆
跟个页面的交互时长
R留存
留存率、
流失率
召回率
流失归因
R变现
生命周期价值(LTV)
借款归因
客盈利
客单价
客成本 CAC
GMV
营销活动ROI [投资回报率 也可以是region of interest 利润/投资总额]
R推荐
邀请率、
裂变系数等
人拉人[member get member]
20210322.11.39