Hive based on big data -- hive actual combat (various problems of statistical film ranking)

Posted by IceD on Tue, 02 Nov 2021 06:06:37 +0100

Author: duktig

Blog: https://duktig.cn (first article)

Excellent still work hard. May you enjoy what you give and enjoy what you get.

See github knowledge base for more articles: https://github.com/duktig666/knowledge

background

After learning Hadoop, do you feel that writing a MapReduce program is very complex, and it requires a lot of development cost to conduct analysis and statistics. Then let's learn about Hive, another member of the Hadoop ecosystem. Let's learn how to use SQL like language to quickly query and analyze data.

In the previous article, we learned about Hive's overview, DDL statements and DML statements (key points), bucket table and partition table, common functions and compressed storage. This article focuses on Hive's actual combat.

Hive series articles are as follows:

Hive actual combat

Requirement description

Statistics of general indicators of video and audio websites in Silicon Valley, including various TopN indicators:

  • Statistics of video viewing number Top10
  • Statistical video category heat Top10
  • Count the categories of the 20 videos with the highest number of video views and the number of top 20 videos in the category
  • Count the number of video views and sort the categories of the videos associated with the Top50
  • Count the video popularity Top10 in each category, taking Music as an example
  • Top 10 video views per category
  • Count the top 10 users who upload the most videos and the top 20 videos they upload

Data preparation

data dictionary

Video table:

fieldremarksDetailed description
videoIdVideo unique id (String)11 bit string
uploaderVideo uploader (String)User name for uploading video
ageVideo age (int)Integer days of video on the platform
categoryVideo category (array < string >)Upload video to the specified video category
lengthVideo length (int)Shaping the video length of the digital ID
viewsViewing times (int)Number of times the video has been viewed
rateVideo rating (Double)Full Score: 5 points
RatingsFlow (int)Video traffic, integer digital
conmentsComments (int)Integer number of comments for a video
relatedIdRelated video id (array < string >)id of relevant video, up to 20
User table:
fieldField typeremarks
uploaderstringUploader user name
videosintNumber of uploaded videos
friendsintNumber of friends

Create table

Create original data table: gulivideo_ori,gulivideo_user_ori,
Create final table (ORC storage format table with snappy compression): gulivideo_orc,gulivideo_user_orc

Create original data table - gulivideo_ori

create table gulivideo_ori( 
    videoId string comment 'Video unique id',  
    uploader string comment 'Video uploader(user name)',  
    age int comment 'Video age (integer days of video on the platform)',  
    category array<string> comment 'Video category',  
    length int comment 'Video length',  
    views int comment 'Viewing times',  
    rate float comment 'Video rating',  
    ratings int comment 'flow',  
    comments int comment 'Number of comments', 
    relatedId array<string> comment 'Related video id(Up to 20)'
) 
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as textfile; 

Create original data table - gulivideo_user_ori

create table gulivideo_user_ori( 
    uploader string comment 'Uploader user name', 
    videos int comment 'Number of uploaded videos ', 
    friends int comment 'Number of friends'
) 
row format delimited  
fields terminated by "\t"  
stored as textfile; 

Create final table - gulivideo_orc

create table gulivideo_orc( 
    videoId string comment 'Video unique id',  
    uploader string comment 'Video uploader(user name)',  
    age int comment 'Video age (integer days of video on the platform)',  
    category array<string> comment 'Video category',  
    length int comment 'Video length',  
    views int comment 'Viewing times',  
    rate float comment 'Video rating',  
    ratings int comment 'flow',  
    comments int comment 'Number of comments', 
    relatedId array<string> comment 'Related video id(Up to 20)'
) 
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc 
tblproperties("orc.compress"="SNAPPY"); 

**Create final table - gulivideo_user_orc **

create table gulivideo_user_orc ( 
    uploader string comment 'Uploader user name', 
    videos int comment 'Number of uploaded videos ', 
    friends int comment 'Number of friends'
) 
row format delimited  
fields terminated by "\t"  
stored as orc 
tblproperties("orc.compress"="SNAPPY"); 

insert data

vedio partial data:

LKh7zAJ4nwo	TheReceptionist	653	Entertainment	424	13021	4.34	1305	744	DjdA-5oKYFQ&NxTDlnOuybo&c-8VuICzXtU&DH56yrIO5nI&W1Uo5DQTtzc&E-3zXq_r4w0&1TCeoRPg5dE&yAr26YhuYNY&2ZgXx72XmoE&-7ClGo-YgZ0&vmdPOOd6cxI&KRHfMQqSHpk&pIMpORZthYw&1tUDzOp10pk&heqocRij5P0&_XIuvoH6rUg&LGVU5DsezE0&uO2kj6_D8B4&xiDqywcDQRM&uX81lMev6_o
7D0Mf4Kn4Xk	periurban	583	Music	201	6508	4.19	687	312	e2k0h6tPvGc&yuO6yjlvXe8&VqpnWBo-R4E&bdDskrr8jRY&y3IDp2n7B48&JngPWhfCb2M&KQaUvH5oiO4&NSzrwv5MCwc&NHB0a0xtLgU&DlRodd4s86s&EzKwOYLh-S0&eUIfRyrqwp8&AK8Wtfwe-1k&Eq4hGkIqBGw&N1lkLaLJHlc&-uIffs-DHkM&zpTorUhCd8Y&AvSK0qPw7EU&WX5KLMqY4bM&VKFqqoeMdjw
n1cEq1C8oqQ	Pipistrello	525	Comedy	125	1687	4.01	363	141	eprHhmurMHg&i30NkTJOrak&2XtLgZol5wI&3nH5Tccz8EQ&bSPVayE0NhE&sEqCkwPmQ_w&hut3VRL5XRE&bWlPSLUT-6U&dsBTo5LExr0&7PSvpPXppXA&yLup8wjbSIo&lbf4d1pZI9c&uRQYan_-CTQ&gnpvEvuiFoQ&F2_5KOnSsfI&DINu35v3eMU&9uSiyn7t_0o&YfShxdbAJS8&ssdfqTwZXY0&z5wDjq8o60c
OHkEzL4Unck	ichannel	638	Comedy	299	8043	4.4	518	371	eyUSTmEUQRg&FDIH1GNQXQE&Wtj31off8-I&mDjwzhc8dQ0&N4EYgXReBzM&NyC_0Z6zoUk&4DxyF39Myto&aiYwo5K0VWg&Ml2NaXU6gms&d0VYKbEbXQ8&LQUV_XGzHmA&8OmL_BJRLRw&qeCFW97-fOA&DVNwUKAuB3I&FMuWYExDEJk&rE7TuuXkk4E&bWicrzq2ApQ&jh6EpXnMb18&9JhU2jE02gg&nfBfC8bif1Y
-boOvAGNKUc	mrpitifulband	639	Music	287	7548	4.48	606	386	fmUwUURgsX0&bR27ACWomug&LlH7WcVptw8&saBmFpuwmKA&lhWk9SXUjWI&aVhSaa6aAOg&W-pvpxlOzZk&0vhVZQEzgcU&dDhCZVQf9po&zIkvMoezI1A&eV2SdBITv8k&cIO6nFDnNs4&Bd7nAtOEA3U&RZo5MisSTWo&geiABCqmQ84&MG1Xv99426g&7wj8-HkZ0XQ&JsdCu9T47iY&OUeN4DhCIFw&sf-Ym_pFP6U
hFFH8DaOHQg	istothehalfabee	592	Music	286	1759	4.45	539	244	hFFH8DaOHQg&ZIo-7BBDaPo&83SpuBijrBY&7TyH0ipgdtY&ZOdRUn0Q9eI&jqNs_S0n7P8&aWAzYehh0ag&vEtM1q6gm9Q&r89-fFx_tHU&h6Hw5030fKs&4qf7RSNCg40&LUXn57T8H50&ejPUALKGOn8&D6ABDEdhQLA&c8UYucsGdTU&El_Xbktje1k&6PAc6ZaK_WI&GUgJKzEmsYI&_sboDb75X2I&oDIIOV4VKlA
LzHjIj3fpR8	Xelanderthomas	686	Comedy	168	4545	4.58	273	167	udr9sLkoZ0s&3IU1GyX_zio&0E7Egr8Y1YI&qr8qZcvTLng&4WwVOWIqE80&Qeeq5OoLGJ0&YYDL1SqX-SY&vWGA5iYgAOU&8FeIj2HLN8k&bKlBTr88VTw&Y_59kWK5W3s&QlJSXVglZ3g&K3h_9O6OwW0&4ALe2z---e0&kdZk1Wk7kSw&hUa7f5XEzGE&aOihMldu_pE&PlPynB10vP0&W9DPlAZUH6Q&vta4RfQ2Z-I
SDNkMu8ZT68	w00dy911	630	People&Blogs	186	10181	3.49	494	257	rjnbgpPJUks
PkGUU_ggO3k	theresident	704	Entertainment	262	11235	3.85	247	280	PkGUU_ggO3k&EYC5bWF0ss8&EUPHdnE83GY&JO1LTIFOkTw&gVSzbvFnVRY&l9NJ04JiZj4&ay3gcr84YeQ&AfBxANiGnnU&RyWz8hwGbY4&BeJ7tGRgiW4&fbq2-jd5Dto&j8fTx5E5rik&qGkCtXLN1W0&mh_MGyx9tgc&bgn6RYut2lE&HS6Nqxh4uf4&m9Gq44o5pcA&K7unV366Qr4&shU2hfHKmU0&p0lq5-8IDqY
RX24KLBhwMI	lemonette	697	People&Blogs	512	24149	4.22	315	474	t60tW0WevkE&WZgoejVDZlo&Xa_op4MhSkg&MwynZ8qTwXA&sfG2rtAkAcg&j72VLPwzd_c&24Qfs69Al3U&EGWutOjVx4M&KVkseZR5coU&R6OaRcsfnY4&dGM3k_4cNhE&ai-cSq6APLQ&73M0y-iD9WE&3uKOSjE79YA&9BBu5N0iFBg&7f9zwx52xgA&ncEV0tSC7xM&H-J8Kbx9o68&s8xf4QX1UvA&2cKd9ERh5-8

user part data:

barelypolitical	151	5106
bonk65	89	144
camelcars	26	674
cubskickass34	13	126
boydism08	32	50
deckthree	6	753
fiveawesomegirls	182	3
ericielfenix	6	0
erricshade	3	49
blacktreemedia	520	3199
childfoundationcom	1	2
davedays	36	32072
fiveawesomeguys	160	2230
communitychannel	71	4280
ashantimusic	12	0
futvolg0les	9	0
all4tubekids	137	1333
ewupawly	65	143
frankjpmorgan	5	0
bethany9788	35	6
dingpolistico	30	1
cpfreak730	18	26
cmcgeh	14	0
chipmunked101	2	0
barNoNsouthport	21	0

Insert data:

Insert data into ori table

load data local inpath "./video" into table gulivideo_ori; 
load data local inpath "./user" into table gulivideo_user_ori; 

Insert data into orc table

insert into table gulivideo_orc select * from gulivideo_ori; 
insert into table gulivideo_user_orc select * from  gulivideo_user_ori; 

Business analysis

1. Count the number of video views Top10

Idea: use order by to make a global sorting according to the views field. At the same time, we set that only the first 10 items are displayed.

Final code:

SELECT  
     videoId, 
     views 
FROM  
     gulivideo_ori 
ORDER BY  
     views DESC 
LIMIT 10; 

2. Statistical video category heat Top10

Idea:

  1. That is, count the number of videos in each category, and display the top 10 categories with the most videos.
  2. We need to aggregate by category group by, and then count the number of videoids in the group.
  3. Because the current table structure is: one video corresponds to one or more categories. Therefore, if you want to group by category, you need to convert the category column to row (expand), and then count.
  4. Finally, the top 10 items are displayed in order of heat.

Code implementation:

Writing method 1:

SELECT  
    t1.category_name ,  
    COUNT(t1.videoId) hot 
FROM  
( 
SELECT  
    videoId,  
    category_name  
FROM  
    gulivideo_orc  
lateral VIEW explode(category) gulivideo_orc_tmp AS category_name 
) t1 
GROUP BY  
    t1.category_name  
ORDER BY 
    hot  
DESC  
LIMIT 10;

Method 2:

SELECT  
    t1.category_name ,  
    COUNT(t1.videoId) hot 
FROM  
( 
SELECT 
    explode(category) category_name
    FROM gulivideo_orc
)t1
GROUP BY  
    t1.category_name  
ORDER BY 
    hot  
DESC  
LIMIT 10;

3. Count the categories of the 20 videos with the highest number of video views and the number of top 20 videos in the category

Idea:

  1. First find all the information of the entries of the 20 videos with the highest number of views and arrange them in descending order

    # t1
    SELECT  
        videoId,  
        views , 
        category  
    FROM  
        gulivideo_orc 
    ORDER BY  
        views  
    DESC  
    LIMIT 20;
    
  2. Split the category in these 20 pieces of information (column to row)

    # t2
    SELECT 
        explode(category) category_name
    FROM t1;
    
  3. Finally, query the video category name and how many Top20 videos are under this category

    SELECT  
    	t2.category_name, 
        COUNT(t2.videoId) video_sum 
    FROM  t2
    GROUP BY t2.category_name;
    

Final code:

SELECT  
	t2.category_name, 
    COUNT(*) video_sum 
FROM  
    ( 
    SELECT 
        explode(category) category_name
    FROM  
        ( 
        SELECT  
            videoId,  
            views , 
            category  
        FROM  
            gulivideo_orc 
        ORDER BY views DESC  
        LIMIT 20  
        ) t1 
	) t2 
GROUP BY t2.category_name 

4. Count the number of video views and sort the categories of the videos associated with the Top50

Idea:

  1. Find the video (array) associated with the video with the number of video views Top50

    # t1
    SELECT  
       relatedId,  
       views, 
    FROM  
       gulivideo_orc 
    ORDER BY views DESC  
    LIMIT 50;
    
  2. Convert associated video columns to rows

    # t2
    SELECT 
        explode(relatedId) related_id
    FROM t1;
    
  3. JOIN the original table and retrieve the category (array) to which the associated video belongs

    # t3
    SELECT 
        g.category
    FROM t2
    JOIN gulivideo_orc g
    ON t2.related_id = g.vedioId;
    
  4. Category field column to row

    # t4
    SELECT 
        explode(category) category_name
    FROM t3;
    
  5. Group by category, find the dount, and sort by count

    SELECT  
        t4.category_name,  
        COUNT(*) hot 
    FROM t4
    GROUP BY t1.category_name  
    ORDER BY hot DESC;
    

Final code:

SELECT  
    t4.category_name,  
    COUNT(*) hot 
FROM 
	(
	SELECT 
    	explode(category) category_name
	FROM
        (
		SELECT 
    		g.category
		FROM 
            (
			SELECT 
    			explode(relatedId) related_id
			FROM
                (
               	SELECT  
   					relatedId,  
   					views, 
				FROM  
   					gulivideo_orc 
				ORDER BY views DESC  
				LIMIT 50
                )t1
            )t2
		JOIN gulivideo_orc g
		ON t2.related_id = g.vedioId
        )t3
    )t4
GROUP BY t4.category_name  
ORDER BY hot DESC;

Other contents are not listed one by one.

Topics: Big Data Hadoop hive