An important basic work in data governance is to analyze the blood relationship of data in the organization. With complete data kinship, we can use it for data traceability, impact analysis of table and field changes, proof of data compliance, inspection of data quality, etc.
The methods of analyzing blood relationship of data are mainly divided into four categories
- Automatic parsing
- System tracking
- machine learning
- Manual collection
Automatic parsing mainly uses tools to parse SQL statements, stored procedures, ETL and other files.
This paper takes Oracle as an example to illustrate how to analyze the blood relationship between SQL and data in stored procedures.
SQL statement generating data blood relationship
- SELECT
- INSERT
- UPDATE
- MERGE
- CREATE VIEW
- CREATE TABLE
- stored procedure
SELECT
You may be surprised that the SELECT statement does not add or modify data. How can data blood relationship be generated?
The secret lies in the select list part of the SELECT statement, where data can be converted.
Take the following SELECT statement as an example:
select sal + commission as totalSal from emp;
We can see that the data of totalSal field comes from emp.sal and emp.commission. Here, the data is converted.
The blood relationship of the data generated in the SELECT is temporary, but the SELECT statement and CREATE VIEW or CREATE TABLE
Once combined, this data blood relationship is really formed. For example:
create view v_sal(mySal) as select sal + commission as totalSal from emp;
The CREATE VIEW statement forms the from emp.sal and emp.commission to V through SELECT_ Sal.mysal's data blood relationship.
Therefore, we can know that analyzing the SELECT statement is the basis of data kinship analysis of SQL statements. Generally speaking, it is intuitive and simple to obtain data blood relationship by analyzing SQL statements, but the key problem is that the efficiency of manual analysis is too low. For many SQL statements in the enterprise, manual analysis is basically impossible.
stored procedure
Stored procedures can contain more complex logical processing, such as conditional judgment, cyclic branching, etc. Therefore, it is often used to complete data extraction, conversion, loading, cleaning and other tasks.
Among them, there are a lot of data blood relationship. In order to better manage the data in the enterprise, it is essential to sort out the data blood relationship in the storage process.
The cursor is a key factor in analyzing the data blood relationship in the stored procedure. The data flow is generally processed around the cursor.
In the following Oracle PL/SQL stored procedure, the cursor cur is defined first_ stclerk.
DECLARE CURSOR cur_stclerk IS SELECT employee_id, department_id, first_name, last_name FROM employees WHERE job_id = 'ST_CLERK';
Then, through this cursor, LOOP with LOOP to insert the data from the employees table into the EMP table_ Temp and emp_detls_temp.
INSERT INTO emp_temp (employee_id, department_id, job_id) VALUES (z_empid, z_depid, 'ST_CLERK'); INSERT INTO emp_detls_temp (employee_id, empname) VALUES (z_empid, z_firstname || ' ' || z_lastname); END LOOP; CLOSE cur_stclerk; COMMIT; END;
Therefore, we can create a table from employees to emp_temp and emp_detls_temp data blood.
Complete PLSQL stored procedure.
DECLARE z_empid employees.employee_id%TYPE; z_depid employees.department_id%TYPE; z_firstname employees.first_name%TYPE; z_lastname employees.last_name%TYPE; CURSOR cur_stclerk IS SELECT employee_id, department_id, first_name, last_name FROM employees WHERE job_id = 'ST_CLERK'; BEGIN OPEN cur_stclerk; LOOP FETCH cur_stclerk INTO z_empid,z_depid,z_firstname, z_lastname; EXIT WHEN cur_stclerk%NOTFOUND; INSERT INTO emp_temp (employee_id, department_id, job_id) VALUES (z_empid, z_depid, 'ST_CLERK'); INSERT INTO emp_detls_temp (employee_id, empname) VALUES (z_empid, z_firstname || ' ' || z_lastname); END LOOP; CLOSE cur_stclerk; COMMIT; END;
Automated data kinship analysis tool
SQLFlow Support the analysis of SQL statements of more than 20 mainstream databases. The supported databases include bigquery, couchbase, DAX, DB2, greenplus, Hana, hive, impala, Informix, mdx, mysql, netezza, ODBC, openedge, Oracle, PostgreSQL, redshift, snowflake, sparksql, sqlserver, Sybase, Teradata, and vertica.
Support analysis of stored procedures and dynamic SQL statements.
Analyze SQL statements through UI
adopt SQLFlow The UI can quickly obtain the data blood relationship of an SQL, and get visual results to help users quickly understand the data blood relationship in an SQL.
Analyze the blood relationship of SQL data through API
Sometimes, we need to store the analyzed data blood as metadata in our own data governance platform and integrate it with other metadata. At this time, we can use it SQLFlow Provided Restful API , use scripts such as shell and python to automate data kinship analysis.
Here, curl is used to show how to use API to access SQLFlow for data kinship analysis.
- Get ready to use the API to connect to SQLFlow userid and secret code
- Use userid and secret code to obtain token. Note that replace the corresponding part of the command with your own userid and secret code.
curl -X POST "https://api.gudusoft.com/gspLive_backend/user/generateToken" -H "Request-Origion:testClientDemo" -H "accept:application/json;charset=utf-8" -H "Content-Type:application/x-www-form-urlencoded;charset=UTF-8" -d "secretKey=YOUR SECRET KEY" -d "userId=YOUR USER ID HERE"
- Submit the SQL statements to be analyzed to SQLFlow for analysis and return the processing results. Note that replace the corresponding part of the command with your own userid and the token returned in the above command.
curl -X POST "https://api.gudusoft.com/gspLive_backend/sqlflow/generation/sqlflow?showRelationType=fdd" -H "Request-Origion:testClientDemo" -H "accept:application/json;charset=utf-8" -H "Content-Type:multipart/form-data" -F "sqlfile=" -F "dbvendor=dbvoracle" -F "ignoreRecordSet=true" -F "simpleOutput=false" -F "sqltext=create view v_sal(mySal) as select sal + commission as totalSal from emp;" -F "userId=YOUR USER ID HERE" -F "token=YOUR TOKEN HERE"
- The returned json contains the following data blood relationship.
"relations": [ { "id": "3", "type": "fdd", "effectType": "create_view", "target": { "id": "11", "column": "mySal", "parentId": "9", "parentName": "v_sal", }, "sources": [ { "id": "3", "column": "sal", "parentId": "2", "parentName": "emp", }, { "id": "4", "column": "commission", "parentId": "2", "parentName": "emp", } ], "processId": "10" } ]
See the official website for more details SQLFlow Restful API.
Summary
This paper introduces how to analyze Oracle SQL statements and stored procedures to obtain the data blood relationship in the organization, so as to better carry out data governance.
It also introduces how to use SQLFlow The tool automates the analysis of data kinship in SQL statements and improves the efficiency and level of data self-care.