Flume+Hive综合案例

TIP

需求：使用Flume按天将日志数据存储到HDFS的对应目录中，并使用SQL按天统计每天的数据指标。

每天产生的JSON数据如下，

json

{
    "id": 86,
    "title": "数据结构和算法极速上手-java版",
    "coverImg": "https://file.aaa.net/video/2023/cover/86.png",
    "oldAmount": 99.00,
    "type": "video_info"
}
{
    "id":2,
    "name": "张三",
    "headImg": "https://file.ttt.net/user/2023/cover/2.png",
    "type": "user_info"
}

需求分析

使用Flume进行数据采集，按照天和类型进行存储，存储到HDFS的/moreType/日期/类型目录下。
Flume的组件如下：exec source、channel可以选择基于文件或内存，sink选择hdfs sink。在hdfssink的path路径中使用%Y%m%d获取日期，将每天的日志数据采集到指定的HDFS目录中。
对按天采集的日志数据建立外部表，以支持多种计算引擎的使用。此外，由于离线计算的需求通常是按天计算的，在表中增加日期分区字段。
不再演示Flume采集数据的流程。

1.创建一个字段存储整个JSON数据

创建表，只有一个字段存储单条JSON数据。

sql

create external table ex_par_more_type (
    log string
) partitioned by (dt string, d_type string) 
 row format delimited 
 fields terminated by '\t'
 location '/moreType';

第二步是添加分区，以下操作每天都得做一次

sql

alter table ex_par_more_type add partition(dt='20231016',d_type='video_info') location '/moreType/20231016/video_info';
alter table ex_par_more_type add partition(dt='20231016',d_type='user_info') location '/moreType/20231016/user_info';

2.创建视图

创建一个视图，该视图用于查询之前创建的外部分区表。在查询时，视图会解析JSON数据中的字段，以便我们以后只需查询视图即可获取所需的字段信息，而无需编写任何代码。

sql

create view view_info_view as select 
get_json_object(log, '$.id') as id,
get_json_object(log, '$.title') as title,
get_json_object(log, '$.coverImg') as coverImg,
get_json_object(log, '$.oldAmount') as oldAmount,
dt
from ex_par_more_type where d_type='video_info'

注意，由于不同类型的数据，JSON结构可能是不同的，在实际应用过程中注意区分。

sql

create view user_info_view as select 
get_json_object(log, '$.id') as id,
get_json_object(log, '$.name') as name,
get_json_object(log, '$.headImg') as headImg,
dt
from ex_par_more_type where d_type='user_info'

查询时可指定日期进行查询

sql

select * from user_info_view where dt = '20231016';

TIP

get_json_object 是 Hive 中的一个内置函数，用于从 JSON 字符串中提取指定的键对应的值。它的语法如下：

sql

get_json_object(json_string, '$.key') -- json_string是json字符串，key是要提取的键

$.key 是一个 JSONPath 表达式，表示提取 JSON 对象中的 key 键的值。

3.编写脚本，定时添加分区

期flume每天都会采集新的数据上传到hdfs上面，所以我们需要每天都做一次添加分区的操作。编写一个脚本addPartition.sh代替人工操作：

shell

#!/bin/bash
# 每天凌晨1点定时添加当天日期的分区
if [ "a$1" = "a" ]
then
    dt=`date +%Y%m%d`
else
    dt=$1
fi
# 指定添加分区操作
hive -e "
alter table ex_par_more_type add if not exists partition(dt='${dt}',d_type='video_info') location '/moreType/${dt}/video_info';
alter table ex_par_more_type add if not exists partition(dt='${dt}',d_type='user_info') location '/moreType/${dt}/user_info';
"

TIP

如果指定的分区已存在，重复添加分区会报错，为了避免报错，需要使用if not exists

corntab定时任务

00 01 * * * root /bin/bash /data/soft/hivedata/addPartition.sh >> /data/soft/hivedata/addPartition.log

需求分析 ​

1.创建一个字段存储整个JSON数据 ​

2.创建视图 ​

3.编写脚本，定时添加分区 ​

corntab定时任务 ​

需求分析

1.创建一个字段存储整个JSON数据

2.创建视图

3.编写脚本，定时添加分区

corntab定时任务