dub老师给滴题目准备

2022-11-30 约 2165 字预计阅读 5 分钟

头条大佬 dubhe 老师给的题目准备，好好刷一下。

SQL 类型

成绩排名

学生成绩表 student,course,score，记录每个学生每门课程（共语文、数学、英语 3 门课）的成绩，请用 SQL 找出每门课程成绩排名前 5 的学生。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


select
    a.course,
    a.student
from 
(
    select 
        scoure,
        student,
        row_number() over (partition by course order by socre desc) rn
    from tab1
) as a 
where a.rn <= 5;

平均分和总分最高

table1:
student_id,class

table2:
student_id,course,score

求各个班级，每一门课程的平均分
求各个班级，总分最高的学生

1
2
3
4
5
6
7
8


select
    a.class,
    b.course,
    avg(b.score) as score_avg
from table1 as a 
left join table2 as b 
    on a.student_id = b.student_id
group by 1,2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


select
    a.class,
    a.student_id,
    a.score_sum
from 
(
    select
        a.class,
        a.student_id,
        a.score_sum,
        row_number() over (partition by a.class order by a.score_sum desc) as rn
    from 
    (
        select
            a.class,
            a.student_id,
            sum(b.score) as score_sum
        from table1 as a 
        left join table2 as b 
            on a.student_id = b.student_id
        group by 1,2
    ) as a 
) as a 
where a.rn = 1

ip 访问次数前 5

给定 hive 表 t1，包含三列，user_id、ip、time：表示 user_id 在时间 time 访问过一次 ip。请利用 hive-sql 计算每个 ip 访问次数排名前 5 的 user_id。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


select
    a.ip,
    a.user_id,
    a.cnt
from 
(
    select
        a.ip,
        a.user_id,
        a.cnt,
        row_number() over (partition by a.ip order by a.cnt desc) as rn 
    from 
    (
        select
            ip,
            user_id,
            count(time) as cnt
        from t1
        group by 1,2
    ) as a 
) as a 
where a.rn <= 5

数据清洗

客户端每隔几分钟都会上报一条记录，形式如 uid、time，我们做数据清洗的时候要求：如果 10 分钟内上报多条的话，就只保留最早的一条。请问怎么用 SQL 实现。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


// 将一个时间维表(分钟维度存储按照 10 分钟进行 N 等分)
with time_tmp as 
(
    select
        time_format, // 格式 12:10
        ntile(24*60/10) over (order by time) as time_rn
    from dim_time 
) 

select 
    a.uid,
    a.time
from 
(
    select
        a.uid,
        a.time,
        row_number() over (partition by b.time_rn order by a.time) rn 
    from table1 as a 
    left join time_tmp as b 
        on concat(hour(a.time), ':', minute(a.time)) = b.time_format
) as a 
where a.rn = 1

充值额最大的账号

假设有一个用户充值日志表 A，字段如下

字段类型含义
dist_id int 区组id
account string 账号
money int 充值金额
create_time string 订单时间

求解：统计某一天各区组下充值额最大的账号以及金额。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


select
    a.dist_id,
    a.account,
    a.money
from 
(
    select
        a.dist_id,
        a.account,
        a.money,
        row_number() over (partition by a.dist_id order by a.money desc) rn
    from tab1 as a 
    where date(a.create_time) = '2022-10-01'
) as a 
where a.rn = 1

观看视频

根据要求写出对应的 sql 语句

表A：用户浏览视频日志 user_behavior:date,user_id,video_id,start_time,end_time
表B：视频信息 video_info:video_id,video_duration
表C：用户信息 user_info:user_id,gender

问题：

某一天（如20200310），观看不同视频个数最多的前 5 名 user_id
观看超过 50 个不用视频的女性用户中，完整观看率最高的 10 个 user_id

1
2
3
4
5
6
7


select
    a.user_id,
    count(distinct a.video_id) as cnt
from tab_a as a 
where a.date = '20200310'
order by 2 desc 
limit 5

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


select
    a.user_id,
    a.video_finish_cnt / a.video_cnt as finish_rate
from 
(
    select
        a.user_id,
        count(distinct a.video_id) over (partition by a.user_id) as video_cnt,
        count(distinct case when unix_timestamp(a.end_time) - unix_timestamp(a.start_time) >= b.video_duration then a.video_id end) over (partition by a.user_id) as video_finish_cnt,
        row_number() over (partition by a.user_id order by a.start_time) rn
    from tab_a as a 
    left join tab_b as b 
        on a.video_id = b.video_id
    left join tab_c as c 
        on a.user_id = c.user_id
    where a.date = '20200310' and c.gender = '女'
) as a 
where a.video_cnt >= 50 and a.rn = 1
order by 2 desc 
limit 10

页面 session 时长计算

用户在浏览页面时，前端以 1 次/s 的频率上报后端，假如连续 30s 后端未收到上报，可以认为一次访问 session 已经结束，使用 hivesql/sparksql 求每次 session 的信息（只有一条记录时任务是 1s）。

示例：

表名为：page_upload_info，单次上报数据结构如下：

user_id->用户id，create_ts->上报时间戳（单位：秒），假如有下数据（数据未排序）：

user_id->1,create_ts->10000
user_id->1,create_ts->10010
user_id->1,create_ts->10003
user_id->1,create_ts->10070
user_id->2,create_ts->10000
user_id->2,create_ts->10003

解答：应返回 3 条结果，分别是：

user_id->1,session->10s,start_ts->10000,end_ts->10010
user_id->1,session->1s,start_ts->10070,end_ts->10070
user_id->2,session->3s,start_ts->10000,end_ts->10003

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


select
    a.user_id,
    case when a.end_ts = a.create_ts then 1 else a.end_ts - a.create_ts end as session,
    a.start_ts,
    a.end_ts
from 
(
    select
        a.user_id,
        a.f_group,
        min(a.create_ts) as start_ts,
        max(a.create_ts) as end_ts
    from 
    (
        select
            a.user_id,
            a.create_ts,
            sum(f_flag) over (partition a.user_id order by a.create_ts desc) f_group
        from 
        (
            select
                a.user_id,
                a.create_ts,
                case when a.diff_create_ts >= 30 or a.diff_create_ts is null then 1 else 0 end as f_flag
            from 
            (
                select
                    a.user_id,
                    a.create_ts,
                    lead(a.create_ts, 1, null) over (partition by a.user_id order by a.create_ts) - a.create_ts as diff_create_ts
                from page_upload_info as a
            ) a 
        ) a
    ) a
    group by 1,2
) a 

留存，和连续活跃

1、用户活跃模型表：user_daily
以 imp_date、user_id 为主键，一个用户 1 天只出现一次，出现即表示当日登陆字段：

字段：
imp_date 时间，user_id 用户id，is_new 是否新用户，0-老用户，1-新用户

2、红包领取日志：money_flow

字段：
imp_date 时间，report_time 红包领取时间戳，用户id，add_money 领取金额

题目1：最近 1 个月每日未领红包用户的次日留存率和 7 日留存
题目2：最近 1 个月，每日 DAU 中，3 天连续活跃用户的占比

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


select
    a.imp_date,
    count(case when b.user_id is not null then a.user_id end)/count(a.user_id) as f_remain_1d_rate,
    count(case when c.user_id is not null then a.user_id end)/count(a.user_id) as f_remain_7d_rate
from 
(
    select
        a.imp_date,
        a.user_id
    from user_daily as a
    left join money_flow as b 
        on a.imp_date = b.imp_date and a.user_id = b.user_id
    where date_diff(current_date, a.imp_date) <= 30 and date_diff(current_date, b.imp_date) <= 30
          and b.user_id is null 
) a
left join user_daily b 
    on date_diff(b.imp_date, a.imp_date) = 1 and a.user_id = b.user_id
left join user_daily c 
    on date_diff(c.imp_date, a.imp_date) = 7 and a.user_id = c.user_id
group by 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


select
    a.imp_date,
    count(case when b.user_id is not null and c.user_id is not null then a.user_id end)/count(a.user_id)  as f_3d_continue_rate
from user_daily as a 
left join user_daily as b 
    on date_diff(a.imp_date, b.imp_date) = 1 and a.user_id = b.user_id
left join user_daily as c
    on date_diff(a.imp_date, c.imp_date) = 2 and a.user_id = c.user_id
where date_diff(current_date, a.imp_date) <= 30
group by 1

算法题

编程类

统计kv

有一份数据有 n 行，每一行是一个由逗号、冒号间隔的字符串（例如 “a:10,b,20,c:30”），逗号拆分后每个都是字符串类型的 “key:value”。现需要统计每个 key 的 value 求和，可尝试用 python 或 hive/sql 或 spark 实现。

(1) 若 python 实现，则输入为一个list：lis1 = ['a:10,b:20,c:30', 'a:100,dog:200', 'hi:40,b:10', 'dog:20']
(2) 若 hive/sql 实现，则输入可设为一个表 A，且仅一列数据。

info
a:10,b:20,c:30
a:100,dog:200
hi:40,b:10
dog:20
……

(3) 若 saprk 实现，则输入可设为一个 rdd/dateframe

最终求得：a: 110, b: 30, c: 30, dog: 220, hi: 40。

答：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


lis1 = ['a:10,b:20,c:30', 'a:100,dog:200', 'hi:40,b:10', 'dog:20']

ret_dict = {}

for item in lis1:
    item_list = item.split(',')
    for i in item_list:
        k = i.split(':')[0]
        v = int(i.split(':')[1])
        if k in ret_dict:
            ret_dict[k] += v
        else:
            ret_dict[k] = v

print(ret_dict)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


select
    concat_ws(',', collect_list(concat(k,':',v))) as ret
from 
(
    select
        k,
        sum(v) as v 
    from 
    (
        select 
            split(value1, ':')[0] as k,
            cast(split(value1, ':')[1] as int) as v
        from tmp_2022120501 t lateral view explode(split(t.my_str, ',')) t1 as value1
    ) as a 
    group by 1
) as a 

目录

dub老师给滴题目准备

SQL 类型

成绩排名

平均分和总分最高

ip 访问次数前 5

数据清洗

充值额最大的账号

观看视频

页面 session 时长计算

留存，和连续活跃

算法题

x的平方根

字符串相加

平方数之和

最长回文子串

查询两个字符串的最长公共子串

单词拆分

编程类

统计kv