Author: duktig
Blog: https://duktig.cn (first article)
Excellent still work hard. May you enjoy what you give and enjoy what you get.
See github knowledge base for more articles: https://github.com/duktig666/knowledge
background
After learning Hadoop, do you feel that writing a MapReduce program is very complex, and it requires a lot of development cost to conduct analysis and statistics. Then let's learn about Hive, another member of the Hadoop ecosystem. Let's learn how to use SQL like language to quickly query and analyze data.
In the previous article, we learned about Hive's overview, DDL statements and DML statements (key points), bucket table and partition table, common functions and compressed storage. This article focuses on Hive's actual combat.
Hive series articles are as follows:
Hive actual combat
Requirement description
Statistics of general indicators of video and audio websites in Silicon Valley, including various TopN indicators:
- Statistics of video viewing number Top10
- Statistical video category heat Top10
- Count the categories of the 20 videos with the highest number of video views and the number of top 20 videos in the category
- Count the number of video views and sort the categories of the videos associated with the Top50
- Count the video popularity Top10 in each category, taking Music as an example
- Top 10 video views per category
- Count the top 10 users who upload the most videos and the top 20 videos they upload
Data preparation
data dictionary
Video table:
field | remarks | Detailed description |
---|---|---|
videoId | Video unique id (String) | 11 bit string |
uploader | Video uploader (String) | User name for uploading video |
age | Video age (int) | Integer days of video on the platform |
category | Video category (array < string >) | Upload video to the specified video category |
length | Video length (int) | Shaping the video length of the digital ID |
views | Viewing times (int) | Number of times the video has been viewed |
rate | Video rating (Double) | Full Score: 5 points |
Ratings | Flow (int) | Video traffic, integer digital |
conments | Comments (int) | Integer number of comments for a video |
relatedId | Related video id (array < string >) | id of relevant video, up to 20 |
User table: |
field | Field type | remarks |
---|---|---|
uploader | string | Uploader user name |
videos | int | Number of uploaded videos |
friends | int | Number of friends |
Create table
Create original data table: gulivideo_ori,gulivideo_user_ori,
Create final table (ORC storage format table with snappy compression): gulivideo_orc,gulivideo_user_orc
Create original data table - gulivideo_ori
create table gulivideo_ori( videoId string comment 'Video unique id', uploader string comment 'Video uploader(user name)', age int comment 'Video age (integer days of video on the platform)', category array<string> comment 'Video category', length int comment 'Video length', views int comment 'Viewing times', rate float comment 'Video rating', ratings int comment 'flow', comments int comment 'Number of comments', relatedId array<string> comment 'Related video id(Up to 20)' ) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile;
Create original data table - gulivideo_user_ori
create table gulivideo_user_ori( uploader string comment 'Uploader user name', videos int comment 'Number of uploaded videos ', friends int comment 'Number of friends' ) row format delimited fields terminated by "\t" stored as textfile;
Create final table - gulivideo_orc
create table gulivideo_orc( videoId string comment 'Video unique id', uploader string comment 'Video uploader(user name)', age int comment 'Video age (integer days of video on the platform)', category array<string> comment 'Video category', length int comment 'Video length', views int comment 'Viewing times', rate float comment 'Video rating', ratings int comment 'flow', comments int comment 'Number of comments', relatedId array<string> comment 'Related video id(Up to 20)' ) row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc tblproperties("orc.compress"="SNAPPY");
**Create final table - gulivideo_user_orc **
create table gulivideo_user_orc ( uploader string comment 'Uploader user name', videos int comment 'Number of uploaded videos ', friends int comment 'Number of friends' ) row format delimited fields terminated by "\t" stored as orc tblproperties("orc.compress"="SNAPPY");
insert data
vedio partial data:
LKh7zAJ4nwo TheReceptionist 653 Entertainment 424 13021 4.34 1305 744 DjdA-5oKYFQ&NxTDlnOuybo&c-8VuICzXtU&DH56yrIO5nI&W1Uo5DQTtzc&E-3zXq_r4w0&1TCeoRPg5dE&yAr26YhuYNY&2ZgXx72XmoE&-7ClGo-YgZ0&vmdPOOd6cxI&KRHfMQqSHpk&pIMpORZthYw&1tUDzOp10pk&heqocRij5P0&_XIuvoH6rUg&LGVU5DsezE0&uO2kj6_D8B4&xiDqywcDQRM&uX81lMev6_o 7D0Mf4Kn4Xk periurban 583 Music 201 6508 4.19 687 312 e2k0h6tPvGc&yuO6yjlvXe8&VqpnWBo-R4E&bdDskrr8jRY&y3IDp2n7B48&JngPWhfCb2M&KQaUvH5oiO4&NSzrwv5MCwc&NHB0a0xtLgU&DlRodd4s86s&EzKwOYLh-S0&eUIfRyrqwp8&AK8Wtfwe-1k&Eq4hGkIqBGw&N1lkLaLJHlc&-uIffs-DHkM&zpTorUhCd8Y&AvSK0qPw7EU&WX5KLMqY4bM&VKFqqoeMdjw n1cEq1C8oqQ Pipistrello 525 Comedy 125 1687 4.01 363 141 eprHhmurMHg&i30NkTJOrak&2XtLgZol5wI&3nH5Tccz8EQ&bSPVayE0NhE&sEqCkwPmQ_w&hut3VRL5XRE&bWlPSLUT-6U&dsBTo5LExr0&7PSvpPXppXA&yLup8wjbSIo&lbf4d1pZI9c&uRQYan_-CTQ&gnpvEvuiFoQ&F2_5KOnSsfI&DINu35v3eMU&9uSiyn7t_0o&YfShxdbAJS8&ssdfqTwZXY0&z5wDjq8o60c OHkEzL4Unck ichannel 638 Comedy 299 8043 4.4 518 371 eyUSTmEUQRg&FDIH1GNQXQE&Wtj31off8-I&mDjwzhc8dQ0&N4EYgXReBzM&NyC_0Z6zoUk&4DxyF39Myto&aiYwo5K0VWg&Ml2NaXU6gms&d0VYKbEbXQ8&LQUV_XGzHmA&8OmL_BJRLRw&qeCFW97-fOA&DVNwUKAuB3I&FMuWYExDEJk&rE7TuuXkk4E&bWicrzq2ApQ&jh6EpXnMb18&9JhU2jE02gg&nfBfC8bif1Y -boOvAGNKUc mrpitifulband 639 Music 287 7548 4.48 606 386 fmUwUURgsX0&bR27ACWomug&LlH7WcVptw8&saBmFpuwmKA&lhWk9SXUjWI&aVhSaa6aAOg&W-pvpxlOzZk&0vhVZQEzgcU&dDhCZVQf9po&zIkvMoezI1A&eV2SdBITv8k&cIO6nFDnNs4&Bd7nAtOEA3U&RZo5MisSTWo&geiABCqmQ84&MG1Xv99426g&7wj8-HkZ0XQ&JsdCu9T47iY&OUeN4DhCIFw&sf-Ym_pFP6U hFFH8DaOHQg istothehalfabee 592 Music 286 1759 4.45 539 244 hFFH8DaOHQg&ZIo-7BBDaPo&83SpuBijrBY&7TyH0ipgdtY&ZOdRUn0Q9eI&jqNs_S0n7P8&aWAzYehh0ag&vEtM1q6gm9Q&r89-fFx_tHU&h6Hw5030fKs&4qf7RSNCg40&LUXn57T8H50&ejPUALKGOn8&D6ABDEdhQLA&c8UYucsGdTU&El_Xbktje1k&6PAc6ZaK_WI&GUgJKzEmsYI&_sboDb75X2I&oDIIOV4VKlA LzHjIj3fpR8 Xelanderthomas 686 Comedy 168 4545 4.58 273 167 udr9sLkoZ0s&3IU1GyX_zio&0E7Egr8Y1YI&qr8qZcvTLng&4WwVOWIqE80&Qeeq5OoLGJ0&YYDL1SqX-SY&vWGA5iYgAOU&8FeIj2HLN8k&bKlBTr88VTw&Y_59kWK5W3s&QlJSXVglZ3g&K3h_9O6OwW0&4ALe2z---e0&kdZk1Wk7kSw&hUa7f5XEzGE&aOihMldu_pE&PlPynB10vP0&W9DPlAZUH6Q&vta4RfQ2Z-I SDNkMu8ZT68 w00dy911 630 People&Blogs 186 10181 3.49 494 257 rjnbgpPJUks PkGUU_ggO3k theresident 704 Entertainment 262 11235 3.85 247 280 PkGUU_ggO3k&EYC5bWF0ss8&EUPHdnE83GY&JO1LTIFOkTw&gVSzbvFnVRY&l9NJ04JiZj4&ay3gcr84YeQ&AfBxANiGnnU&RyWz8hwGbY4&BeJ7tGRgiW4&fbq2-jd5Dto&j8fTx5E5rik&qGkCtXLN1W0&mh_MGyx9tgc&bgn6RYut2lE&HS6Nqxh4uf4&m9Gq44o5pcA&K7unV366Qr4&shU2hfHKmU0&p0lq5-8IDqY RX24KLBhwMI lemonette 697 People&Blogs 512 24149 4.22 315 474 t60tW0WevkE&WZgoejVDZlo&Xa_op4MhSkg&MwynZ8qTwXA&sfG2rtAkAcg&j72VLPwzd_c&24Qfs69Al3U&EGWutOjVx4M&KVkseZR5coU&R6OaRcsfnY4&dGM3k_4cNhE&ai-cSq6APLQ&73M0y-iD9WE&3uKOSjE79YA&9BBu5N0iFBg&7f9zwx52xgA&ncEV0tSC7xM&H-J8Kbx9o68&s8xf4QX1UvA&2cKd9ERh5-8
user part data:
barelypolitical 151 5106 bonk65 89 144 camelcars 26 674 cubskickass34 13 126 boydism08 32 50 deckthree 6 753 fiveawesomegirls 182 3 ericielfenix 6 0 erricshade 3 49 blacktreemedia 520 3199 childfoundationcom 1 2 davedays 36 32072 fiveawesomeguys 160 2230 communitychannel 71 4280 ashantimusic 12 0 futvolg0les 9 0 all4tubekids 137 1333 ewupawly 65 143 frankjpmorgan 5 0 bethany9788 35 6 dingpolistico 30 1 cpfreak730 18 26 cmcgeh 14 0 chipmunked101 2 0 barNoNsouthport 21 0
Insert data:
Insert data into ori table
load data local inpath "./video" into table gulivideo_ori; load data local inpath "./user" into table gulivideo_user_ori;
Insert data into orc table
insert into table gulivideo_orc select * from gulivideo_ori; insert into table gulivideo_user_orc select * from gulivideo_user_ori;
Business analysis
1. Count the number of video views Top10
Idea: use order by to make a global sorting according to the views field. At the same time, we set that only the first 10 items are displayed.
Final code:
SELECT videoId, views FROM gulivideo_ori ORDER BY views DESC LIMIT 10;
2. Statistical video category heat Top10
Idea:
- That is, count the number of videos in each category, and display the top 10 categories with the most videos.
- We need to aggregate by category group by, and then count the number of videoids in the group.
- Because the current table structure is: one video corresponds to one or more categories. Therefore, if you want to group by category, you need to convert the category column to row (expand), and then count.
- Finally, the top 10 items are displayed in order of heat.
Code implementation:
Writing method 1:
SELECT t1.category_name , COUNT(t1.videoId) hot FROM ( SELECT videoId, category_name FROM gulivideo_orc lateral VIEW explode(category) gulivideo_orc_tmp AS category_name ) t1 GROUP BY t1.category_name ORDER BY hot DESC LIMIT 10;
Method 2:
SELECT t1.category_name , COUNT(t1.videoId) hot FROM ( SELECT explode(category) category_name FROM gulivideo_orc )t1 GROUP BY t1.category_name ORDER BY hot DESC LIMIT 10;
3. Count the categories of the 20 videos with the highest number of video views and the number of top 20 videos in the category
Idea:
-
First find all the information of the entries of the 20 videos with the highest number of views and arrange them in descending order
# t1 SELECT videoId, views , category FROM gulivideo_orc ORDER BY views DESC LIMIT 20;
-
Split the category in these 20 pieces of information (column to row)
# t2 SELECT explode(category) category_name FROM t1;
-
Finally, query the video category name and how many Top20 videos are under this category
SELECT t2.category_name, COUNT(t2.videoId) video_sum FROM t2 GROUP BY t2.category_name;
Final code:
SELECT t2.category_name, COUNT(*) video_sum FROM ( SELECT explode(category) category_name FROM ( SELECT videoId, views , category FROM gulivideo_orc ORDER BY views DESC LIMIT 20 ) t1 ) t2 GROUP BY t2.category_name
4. Count the number of video views and sort the categories of the videos associated with the Top50
Idea:
-
Find the video (array) associated with the video with the number of video views Top50
# t1 SELECT relatedId, views, FROM gulivideo_orc ORDER BY views DESC LIMIT 50;
-
Convert associated video columns to rows
# t2 SELECT explode(relatedId) related_id FROM t1;
-
JOIN the original table and retrieve the category (array) to which the associated video belongs
# t3 SELECT g.category FROM t2 JOIN gulivideo_orc g ON t2.related_id = g.vedioId;
-
Category field column to row
# t4 SELECT explode(category) category_name FROM t3;
-
Group by category, find the dount, and sort by count
SELECT t4.category_name, COUNT(*) hot FROM t4 GROUP BY t1.category_name ORDER BY hot DESC;
Final code:
SELECT t4.category_name, COUNT(*) hot FROM ( SELECT explode(category) category_name FROM ( SELECT g.category FROM ( SELECT explode(relatedId) related_id FROM ( SELECT relatedId, views, FROM gulivideo_orc ORDER BY views DESC LIMIT 50 )t1 )t2 JOIN gulivideo_orc g ON t2.related_id = g.vedioId )t3 )t4 GROUP BY t4.category_name ORDER BY hot DESC;
Other contents are not listed one by one.