Hadoop 8-day course - day 6, MapReduce details

Posted by mainewoods on Wed, 29 Apr 2020 14:54:07 +0200


About hive

A tool for translating sql statements into mapreduce programs.

Create table statement

CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 FIELDS TERMINATED BY '\001' //Specifies the separator between fields
 STORED AS SEQUENCEFILE;     //Specify the type of file

After the table is created, hive will save the metadata of the table to the metadata database, which is dobby database by default. mysql is recommended.

Import data

Import data after table creation:

create table tab_ip(id int,name string,ip string,country string) 
    row format delimited
    fields terminated by ','
    stored as textfile;
load data local inpath '/home/hadoop/ip.txt' into table tab_ext;

In fact, importing data is to upload data to a folder in hdfs. If you create a well formatted file and put it in the corresponding hdfs folder, you can also get results by querying in hive. If some fields are missing, the values will be assigned to the previous fields in book order.

Create appearance

By creating a facade, you can directly associate a table with a file in an hdfs folder. However, the operation of creating a table will not have any impact on the file: the file will not be moved when the table is created, and the file will not be deleted synchronously when the table is deleted.

CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
     ip STRING,
     country STRING)
 LOCATION '/external/hive';

CTAS is used to create temporary tables to store intermediate results

CREATE TABLE tab_ip_ctas
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;

This operation is used to create some temporary tables to store intermediate results. It will also create a folder with the same name as the table on hdfs. The data under the folder is actually the output of mr program of select statement.

Bulk insert (data append)

create table tab_ip_like like tab_ip;

insert overwrite table tab_ip_like
    select * from tab_ip;

The action appends intermediate result data to the temporary table.

Create partition table

create table tab_ip_part(id int,name string,ip string,country string) 
    partitioned by (part_flag string)
    row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_part
select * from tab_ip_part;

select * from tab_ip_part  where part_flag='part2';

show partitions tab_ip_part;    //Display partition information of the table

The partition field can be in a table field or it can be a field that is not in the table. In fact, under the table folder, create a partition folder for each partition to store the files corresponding to the partition. When querying a partition separately, you only need to specify the value of the partition key in the where clause.

Other hql syntax

//write to hdfs
insert overwrite local directory '/home/hadoop/hivetemp/test.txt' select * from tab_ip_part where part_flag='part1';    
insert overwrite directory '/hiveout.txt' select * from tab_ip_part where part_flag='part1';

//Data type of array
create table tab_array(a array<int>,b array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

//map data type
create table tab_map(name string,info map<string,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ';'
map keys terminated by ':';

hive custom function

Write a java class to define the implementation logic of the function, and then make a jar package and upload it to the lib directory of hive. Then define a function in hive and manage the java classes in the jar package.

Requirement: implement a hive custom function to return the location of a phone number.


  1. To inherit the concrete logic of implementing the evaluate method from the org.apache.hadoop.hive.ql.exec.UDF class
  2. Add jar package (execute in hive command line): hive> add jar /home/hadoop/hiveareaudf.jar;
  3. To create a temporary function: create temporary function getarea as 'cn.itcast.bigdata.PhoneNbrToArea';

Code of java part:

package cn.itcast.bigdata;

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;

public class PhoneNbrToArea extends UDF{

	private static HashMap<String, String> areaMap = new HashMap<>();
	static {
		areaMap.put("1388", "beijing");
		areaMap.put("1399", "tianjin");
		areaMap.put("1366", "nanjing");
	//It must be decorated with public to be called by hive
	public String evaluate(String pnb) {
		String result  = areaMap.get(pnb.substring(0,4))==null? (pnb+"    huoxing"):(pnb+"  "+areaMap.get(pnb.substring(0,4)));		
		return result;


Distributed, elastic and scalable large-scale structured data management system is suitable for random and real-time data reading and writing. hbase is based on hdfs file system. hbase does not provide table Association operations and transaction features, but supports large-scale single tables. In a large scale, hbase belongs to nosql database.

The technical details of hbase can refer to the Chinese documents on the official website. Xiaomi team will write their own experience about hbase on a website, you can have a look.

1. Table structure of HBase

Disadvantages of traditional relational database tables

  1. Table structure is not easy to change

Table structure of hbase

When you create a table, you do not specify a column name, but a column family.

When storing data, each column family can store multiple key value pairs, and the number of key value pairs of the same column family of each tuple can be different. The value of each key value pair can record multiple versions. Each key and its multiple version values are combined to form a cell.

When querying, you need to specify the coordinates: table name - > row key - > column family: column name - > version number.

How hbase stores super large tables in a distributed way.

Region: row segmentation of a user table. The number of rows is not necessarily the same, depending on whether the region has reached a specified size. Regions of a table will be placed in multiple region server s.

Different column families in a region are stored as a store, and the data with high access frequency will be stored in a memory, so as to improve the access efficiency.

HMaster: not responsible for storing specific data, mainly responsible for the management of region server and the load balance between region servers.

System table META table: stores META information about the distributed storage of tables, such as the information about the start and end row keys of each region table, and which region server the entire region is on. The META table may also be very large, and it is divided into multiple regions for distributed storage. The system table that records META information of META region of META data is called ROOT table. The ROOT table will not be distributed storage, it will be completely stored in a region server. Which region server the ROOT table is on will store this information on zookeeper. So make sure to query a row of data through three levels of addressing: ROOT table, region table of META and region table of user table.

shell operation of hbase

hbase and hadoop need to be consistent.

hbase installation

  1. We need to put hdfs-site.xml and core-site.xml of hadoop under hbase/conf.

  2. Modify HBase env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_55
//Tell hbase to use external zk
export HBASE_MANAGES_ZK=false
  1. Modify hbase-site.xml
	<!-- Appoint hbase stay HDFS Path stored on -->
	<!-- Appoint hbase It's distributed -->
	<!-- Appoint zk Address, multiple“,"Division -->

Java API of hbase

hbase filter

Topics: Big Data HBase hive Hadoop Java