A complete collection of common commands for big data development

Posted by mwd2005 on Wed, 15 Dec 2021 07:17:59 +0100

Linux(vi/vim)

Pattern

 

Edit mode

 

Instruction mode

 

Compression and decompression

gzip/gunzip compression

(1) Only files can be compressed, not directories

(2) Do not keep the original file

gzip compression: gzip hello txt

Gunzip extract file: gunzip hello txt. gz

zip/unzip compression

You can compress directories and keep source files

Zip compression (compress 1.txt and 2.txt, and the compressed name is mypackage.zip): zip hello.zip hello.txt world.txt

Unzip unzip: unzip hello zip

Unzip to the specified directory: unzip hello zip -d /opt

tar packaging

Tar compresses multiple files: tar - zcvf hello txt world. txt

Tar compressed Directory: tar -zcvf hello tar. gz opt/

Extract tar to the current directory: tar -zxvf hello tar. gz

Extract tar to the specified directory: tar -zxvf hello tar. gz -C /opt

RPM

RPM query command: rpm -qa |grep firefox

RPM unload command:

rpm -e xxxxxx

rpm -e --nodeps xxxxxx (do not check dependencies)

RPM installation command:

rpm -ivh xxxxxx.rpm

rpm -ivh --nodeps fxxxxxx.rpm (- - nodeps, do not detect dependent progress)

 

Shell

Input / output redirection

 

Script editing

 

Hadoop

Start class Command

 

hadoop fs/hdfs dfs command

 

 

yarn command

 

Zookeeper

Start command

 

basic operation

 

Four letter command

 

Kafka

Note: I only write one machine here. You can also use the command/ bin/xx.sh (e.g. / bin / Kafka topics. SH)

View all topic s in the current server

kafka-topics --zookeeper xxxxxx:2181 --list --exclude-internal 

explain:

exclude-internal: exclude kafka inside topic

For example: --exclude-internal  --topic "test_.*"

Create topic

kafka-topics --zookeeper xxxxxx:2181  --create 
--replication-factor 
--partitions 1 
--topic topic_name

explain:

--topic definition topic name

--replication-factor  Define the number of copies

--partitions  Define number of partitions

Delete topic

Note: server is required Set delete in properties topic. Enable = true, otherwise it is only marked for deletion

kafka-topics --zookeeper xxxxxx:2181 --delete --topic topic_name

producer

kafka-console-producer --broker-list xxxxxx:9092 --topic topic_name

Can add:--property parse.key=true(have key Message)

consumer

kafka-console-consumer --bootstrap-server xxxxxx:9092 --topic topic_name

Note: optional

--from-beginning: All previous data in the topic will be read out

--whitelist '.*' : Consume all topic

--property print.key=true: display key Consumption

--partition 0: Specify partition consumption

--offset: Specify start offset consumption

View the details of a Topic

kafka-topics --zookeeper xxxxxx:2181 --describe --topic topic_name

Modify the number of partitions

kafka-topics --zookeeper xxxxxx:2181 --alter --topic topic_name --partitions 6

View a consumer group information

kafka-consumer-groups --bootstrap-server  xxxxxx:9092  --describe --group group_name 

Delete consumer group

kafka-consumer-groups --bootstrap-server  xxxxxx:9092  ---delete --group group_name 

Reset offset

kafka-consumer-groups --bootstrap-server  xxxxxx:9092  --group group_name

--reset-offsets --all-topics --to-latest --execute 

leader re-election

Specify the Topic to specify the partition and use the preempted: priority replica policy to re elect the Leader

kafka-leader-election --bootstrap-server xxxxxx:9092 
--topic topic_name --election-type PREFERRED --partition 0

All topics and all partitions re elect leaders with the re PREFERRED: priority replica policy

kafka-leader-election --bootstrap-server xxxxxx:9092 
--election-type preferred  --all-topic-partitions

Query kafka version information

kafka-configs --bootstrap-server xxxxxx:9092
--describe --version

Add, delete and change configuration

 

topic add / modify dynamic configuration

kafka-configs --bootstrap-server xxxxxx:9092
--alter --entity-type topics --entity-name topic_name 
--add-config file.delete.delay.ms=222222,retention.ms=999999

topic delete dynamic configuration

kafka-configs --bootstrap-server xxxxxx:9092 
--alter --entity-type topics --entity-name topic_name 
--delete-config file.delete.delay.ms,retention.ms

Continuous batch pull messages

Maximum consumption of 10 messages per time (without parameters, it means continuous consumption)

kafka-verifiable-consumer --bootstrap-server xxxxxx:9092 
--group group_name
--topic topic_name --max-messages 10

Delete message for specified partition

Delete the message of a partition of the specified topic to an offset of 1024

JSON file offset JSON file json

{
    "partitions": [
        {
            "topic": "topic_name",
            "partition": 0,
            "offset": 1024
        }
    ],
    "version": 1
}
kafka-delete-records --bootstrap-server xxxxxx:9092 
--offset-json-file offset-json-file.json

View Broker disk information

Query the disk information of the specified topic

kafka-log-dirs --bootstrap-server xxxxxx:9090 
--describe --topic-list topic1,topic2

Query specified Broker disk information

kafka-log-dirs --bootstrap-server xxxxxx:9090 
--describe --topic-list topic1 --broker-list 0

Hive

Startup class

 

hive starts the metadata service (metastore and hiveserver2) and closes the script gracefully

Start: hive.sh start
 close: hive.sh stop
 Restart: hive.sh restart
 Status: hive.sh status

The script is as follows

#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs

mkdir -p $HIVE_LOG_DIR

#Check whether the process is running normally. Parameter 1 is the process name and parameter 2 is the process port
function check_process()
{
    pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
    ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
    echo $pid
    [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1
}

function hive_start()
{
    metapid=$(check_process HiveMetastore 9083)
    cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
    cmd=$cmd" sleep4; hdfs dfsadmin -safemode wait >/dev/null 2>&1"
    [ -z "$metapid" ] && eval $cmd || echo "Metastroe Service started"
    server2pid=$(check_process HiveServer2 10000)
    cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
    [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2 Service started"
}

function hive_stop()
{
    metapid=$(check_process HiveMetastore 9083)
    [ "$metapid" ] && kill $metapid || echo "Metastore Service not started"
    server2pid=$(check_process HiveServer2 10000)
    [ "$server2pid" ] && kill $server2pid || echo "HiveServer2 Service not started"
}

case $1 in
"start")
    hive_start
    ;;
"stop")
    hive_stop
    ;;
"restart")
    hive_stop
    sleep 2
    hive_start
    ;;
"status")
    check_process HiveMetastore 9083 >/dev/null && echo "Metastore The service is running normally" || echo "Metastore Service running abnormally"
    check_process HiveServer2 10000 >/dev/null && echo "HiveServer2 The service is running normally" || echo "HiveServer2 Service running abnormally"
    ;;
*)
    echo Invalid Args!
    echo 'Usage: '$(basename $0)' start|stop|restart|status'
    ;;
esac

Common interactive commands

 

SQL class (special)

 

Built in function

(1) NVL

Assign a value to the data whose value is NULL, and its format is NVL (value, default_value). Its work Big data training Yes, if value is NULL, the NVL function returns default_value, otherwise return the value of value. If both parameters are NULL, return NULL

select nvl(column, 0) from xxx;

(2) Row to column

 

(3) Column to row (one column to multiple rows)

Split(str, separator): cut the string according to the following separator and convert it into character array.

"Expand (Col):" split the complex array or map structure in the hive column into multiple rows.

「LATERAL VIEW」

Usage:

LATERAL VIEW udtf(expression) tableAlias AS columnAlias

Explanation: lateral view is used with split, expand and other udtfs. It can split a row of data into multiple rows of data. On this basis, it can aggregate the split data.

The lateral view first calls UDTF for each row of the original table. UDTF will split a row into one or more rows, and then the lateral view combines the results to generate a virtual table that supports alias tables.

"Prepare data source test"

 

「SQL」

SELECT movie,category_name 
FROM movie_info 
lateral VIEW
explode(split(category,",")) movie_info_tmp  AS category_name ;

"Test results"

<Meritorious service      record
<Meritorious service      plot
<War wolf 2     Warfare
<War wolf 2     action
<War wolf 2     disaster

Window function

(1)OVER()

Determine the data window size of the analysis function. The data window size may change with the change of rows.

(2) CURRENT ROW

n PRECEDING: Go ahead n Row data

n FOLLOWING: Back n Row data

(3) UNBOUNDED

UNBOUNDED PRECEDING The front has no boundary, indicating the starting point from the front

UNBOUNDED FOLLOWING There is no boundary at the back, indicating the end point at the back

SQL case: aggregation from starting point to current row

select 
    sum(money) over(partition by user_id order by pay_time rows between UNBOUNDED PRECEDING and current row) 
from or_order;

"SQL case: aggregate the current row and the previous row"

select 
    sum(money) over(partition by user_id order by pay_time rows between 1 PRECEDING and current row) 
from or_order;

"SQL case: aggregate the current row with the previous row and the next row"

select 
    sum(money) over(partition by user_id order by pay_time rows between 1 PRECEDING AND 1 FOLLOWING )
from or_order;

"SQL case: current line and all subsequent lines"

select 
    sum(money) over(partition by user_id order by pay_time rows between current row and UNBOUNDED FOLLOWING  )
from or_order;

(4)LAG(col,n,default_val)

If there is no data in the nth row, default_val

(5)LEAD(col,n, default_val)

If there is no data in the next n line, default_val

"SQL case: query user purchase details, last purchase time and next purchase time"

select 
 user_id,,pay_time,money,
 
 lag(pay_time,1,'1970-01-01') over(PARTITION by name order by pay_time) prev_time,
 
 lead(pay_time,1,'1970-01-01') over(PARTITION by name order by pay_time) next_time
from or_order;

(6)FIRST_VALUE(col,true/false)

The first value in the current window and the second parameter are true. Skip null values.

(7)LAST_VALUE (col,true/false)

The last value under the current window. The second parameter is true. Skip null values.

"SQL case: query the first purchase time and the last purchase time of each month"

select
 FIRST_VALUE(pay_time) 
     over(
         partition by user_id,month(pay_time) order by pay_time 
         rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
         ) first_time,
 
 LAST_VALUE(pay_time) 
     over(partition by user_id,month(pay_time) order by pay_time rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
     ) last_time
from or_order;

(8)NTILE(n)

Distribute the rows of the ordered window to the groups of the specified data. Each group has a number. The number starts from 1. For each row, NTILE returns the number of the group to which the row belongs. (used to divide the grouped data into n slices in order and return the current slice value)

"SQL case: query the order information in the first 25% of the time"

select * from (
    select User_id,pay_time,money,
    
    ntile(4) over(order by pay_time) sorted
    
    from or_order
) t
where sorted = 1;

4 By

(1)Order By

For global sorting, there is only one Reducer.

(2)Sort By

The partition is orderly.

(3)Distrbute By

Similar to Partition in MR, Partition is used in combination with sort by.

(4) Cluster By

When the distribution by and Sorts by fields are the same, the Cluster by method can be used. In addition to the function of Distribute by, Cluster by also has the function of Sort by. However, sorting can only be in ascending order, and the sorting rule cannot be ASC or DESC.

In the production environment, Order By is rarely used, which easily leads to OOM.

In the production environment, sort by + distribute by is used more frequently.

Sorting function

(1)RANK()

If the sorting is the same, it will repeat, and the total number will not change

1
1
3
3
5

(2)DENSE_RANK()

If the sorting is the same, it will be repeated and the total number will be reduced

1
1
2
2
3

(3)ROW_NUMBER()

Will be calculated in order

1
2
3
4
5

Date function

datediff: returns the number of days from the end date minus the start date

datediff(string enddate, string startdate) 

select datediff('2021-11-20','2021-11-22') 

date_add: returns the start date. startdate is the date after days is added

date_add(string startdate, int days) 

select date_add('2021-11-20',3) 

date_sub: returns the date after the start date is reduced by days

date_sub (string startdate, int days) 

select date_sub('2021-11-22',3)

Redis

Startup class

key

 

String

 

List

 

Set

 

Hash

 

zset(Sorted set)

 

Flink

start-up

./start-cluster.sh 

run

./bin/flink run [OPTIONS]

./bin/flink run -m yarn-cluster -c com.wang.flink.WordCount /opt/app/WordCount.jar

 

info

./bin/flink info [OPTIONS]

 

list

./bin/flink list [OPTIONS]

 

stop

./bin/flink stop  [OPTIONS] <Job ID>

 

Cancel (weakening)

./bin/flink cancel  [OPTIONS] <Job ID>

 

savepoint

./bin/flink savepoint  [OPTIONS] <Job ID>

 

Original author: Wang Gebo

Topics: Big Data