python和hive结合使用

2023年1月4日08:58:58

主要工作:利用python脚本对日志文件解析,将解析后的每个字段存放到对应表中;

(1)创建自己的数据库,存放所有自己建立的表:

     hive>create database lina;

 (2)进入数据库中,并创建两个表格分别是record,log:

    hive>create table record(da string);

    hive>load data inpath '/source_data/日志文件' into table record;

    这样每一行日志就存放到da中,下步就是调用python脚本对da中的字符串进行解析。在此之前要把python脚本存放到mysql数据库,该脚本在每一启动Hive的时候都要重新加载一次;

    hive> add file /data0/cdh/WeiboLog/resolveLog.py;
    出现下面的就表示成功:Added resource: /data0/cdh/WeiboLog/resolveLog.py
 

    hive> create log(server string,time string, url string,appkey string,uid string,ip string,pool string);

    hive> from record                                                                   
        > insert overwrite table log                                                    
        > select TRANSFORM(da) using 'resolveLog.py' as serive,time,url,appkey,uid,ip,pool;

    这样就能把数据加载到表log中,可以对log表进行SQL操作!!!!

    弄出来真不容易,太感谢师傅了!!!

 

存在问题:

在创建表record时,要设置文本分隔符,否则会导致结果错误。比如通过mapreduce统计的日志文件是20505509条,而计算没有设置行分隔符的表record的结果是21771333条,结果偏差很大,以后建表时要注意的问题。

重新建立表records,如下:

hive> create table records(line string)
    > row format delimited
    > lines terminated by '\n' stored as textfile

    > load data inpath '/source_data/openapi_v4-2012-07-18_00000' into table records;

建立表logs:

hive> create table logs(server string,time string, url string,appkey string,uid string,ip string,pool string)
    > row format delimited 
                                                                                  
    > fields terminated by '\001'                                                                           
    > lines terminated by '\n' stored as textfile; 

将表records中的数据导入表logs中:

hive> add file /data0/cdh/WeiboLog/resolveLog.py;
Added resource: /data0/cdh/WeiboLog/resolveLog.py
hive> from records

    > insert overwrite table logs
    > select transform(line)
    > using 'resolveLog.py'
    > as server,time, url,appkey,uid,ip,pool;    

这样在运行hive> select count(*) from logs;结果就正确了!! 

  • 作者:xiewenbo
  • 原文链接:https://blog.csdn.net/xiewenbo/article/details/12705693
    更新时间:2023年1月4日08:58:58 ,共 1817 字。