Case-Use MapReduce to implement join operation

Posted by dvonj on Mon, 11 Nov 2019 09:40:01 +0100

Happy Mid-Autumn Festival, everyone. I haven't updated any new articles for a long time. Today, share how to use mapreduce to join.

In offline computing, we often operate on more than a single file, requiring more data to be associated with two or more files, similar to join operations in sql.
Share with you today how to implement join in MapReduce

demand

There are two, one is the product information table and the other is the order table.Only the product ID is stored in the order form, and associations must be used if you want to find out the order and information about the product.

Realization

According to the MapReduce feature, it is well known that on the reduce side, the same key, value pair will be placed in the same reduce method (without setting the partition).
Using this feature, we can easily implement the join operation, see the example below.

Product Table

ID brand model
p0001 Apple iphone11 pro max
p0002 Huawei p30
p0003 millet mate10

Order form

id name address produceID num
00001 kris Futian District, Shenzhen p0001 1
00002 pony Shenzhen Nanshan District p0001 2
00003 jack Bantian District, Shenzhen p0001 3

If the amount of data is huge and the data of two tables is stored in HDFS as files, you need to use mapreduce program to implement SQL query operation:

select a.id,a.name,a.address,a.num from t_orders a join t_products on a.productID=b.ID

MapReduce implementation ideas

By using the associated condition (prodcueID) as the key for map output, two tables of data satisfying the join condition are sent to the same destination with the file information from which the data originates
 reduce task, concatenating data in reduce

Implementation 1-reduce end join

Define a Bean

public class RJoinInfo implements Writable{
    private String customerName="";
    private String customerAddr="";
    private String orderID="";
    private int orderNum;
    private String productID="";
    private String productBrand="";
    private String productModel="";
//    0 is product, 1 is order
    private int flag;
    
    setter/getter

Write Mapper

public class RJoinMapper extends Mapper<LongWritable,Text,Text,RJoinInfo> {
    private static Logger logger = LogManager.getLogger(RJoinMapper.class);
    private RJoinInfo rJoinInfo = new RJoinInfo();
    private Text k = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//        Many of them include databases and so on.This is a file, so you can force it directly into a file slice
        FileSplit fileSplit = (FileSplit) context.getInputSplit();
//        Get file name
        String name = fileSplit.getPath().getName();
        logger.info("splitPathName:"+name);

        String line = value.toString();
        String[] split = line.split("\t");


        String productID = "";

            if(name.contains("product")){
                productID = split[0];
                String setProductBrand = split[1];
                String productModel = split[2];

                rJoinInfo.setProductID(productID);
                rJoinInfo.setProductBrand(setProductBrand);
                rJoinInfo.setProductModel(productModel);
                rJoinInfo.setFlag(0);
            }else if(name.contains("orders")){
                String orderID = split[0];
                String customerName = split[1];
                String cutsomerAddr = split[2];
                productID = split[3];
                String orderNum = split[4];

                rJoinInfo.setProductID(productID);
                rJoinInfo.setCustomerName(customerName);
                rJoinInfo.setCustomerAddr(cutsomerAddr);
                rJoinInfo.setOrderID(orderID);
                rJoinInfo.setOrderNum(Integer.parseInt(orderNum));
                rJoinInfo.setFlag(1);
            }

        k.set(productID);
        context.write(k,rJoinInfo);
    }
}

Code explanation, depending on the file name of split, determines whether products or orders,
Then get different data based on whether it's productID or orders, most often sent to Reduce with productID as Key

Write Reducer

public class RJoinReducer extends Reducer<Text,RJoinInfo,RJoinInfo,NullWritable> {
    private static Logger logger = LogManager.getLogger(RJoinReducer.class);
    @Override
    protected void reduce(Text key, Iterable<RJoinInfo> values, Context context) throws IOException, InterruptedException {
        List<RJoinInfo> orders = new ArrayList<>();

        String productID = key.toString();
        logger.info("productID:"+productID);
        RJoinInfo rJoinInfo = new RJoinInfo();

        for (RJoinInfo value : values) {
            int flag = value.getFlag();
            if (flag == 0) {
//                product
                try {
                    BeanUtils.copyProperties(rJoinInfo,value);
                } catch (IllegalAccessException e) {
                    logger.error(e.getMessage());
                } catch (InvocationTargetException e) {
                    logger.error(e.getMessage());
                }
            }else {
//                Order
                RJoinInfo orderInfo = new RJoinInfo();
                try {
                    BeanUtils.copyProperties(orderInfo,value);
                } catch (IllegalAccessException e) {
                    logger.error(e.getMessage());
                } catch (InvocationTargetException e) {
                    logger.error(e.getMessage());
                }
                orders.add(orderInfo);
            }
        }

        for (RJoinInfo order : orders) {
            rJoinInfo.setOrderNum(order.getOrderNum());
            rJoinInfo.setOrderID(order.getOrderID());
            rJoinInfo.setCustomerName(order.getCustomerName());
            rJoinInfo.setCustomerAddr(order.getCustomerAddr());

//          Output key only, value can use nullwritable
            context.write(rJoinInfo,NullWritable.get());
        }
    }
}

Code Interpretation: Depending on the productID, different groups are sent to the reduce side, which has one product object and multiple order objects when it gets the last set of data.
Traverse through each object, distinguishing products and orders by flag.Save product objects and get each order object into a collection.When we divide each object
After class, traversing the order collection will collect the order and product information, and then output.

Note: Although we are not the most efficient here, we mainly want to illustrate the idea of join.

Write Driver

public class RJoinDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();
//        conf.set("mapreduce.framework.name","yarn");
//        conf.set("yarn.resourcemanager.hostname","server1");
//        conf.set("fs.defaultFS","hdfs://server1:9000");
        conf.set("mapreduce.framework.name","local");
        conf.set("fs.defaultFS","file:///");

        Job job = Job.getInstance(conf);

//       If it's running locally, you don't have to set the path to the jar package, because you don't have to copy the jar elsewhere
        job.setJarByClass(RJoinDriver.class);
//        job.setJar("/Users/kris/IdeaProjects/bigdatahdfs/target/rjoin.jar");

        job.setMapperClass(RJoinMapper.class);
        job.setReducerClass(RJoinReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(RJoinInfo.class);
        job.setOutputKeyClass(RJoinInfo.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job,new Path("/Users/kris/Downloads/rjoin/input"));
        FileOutputFormat.setOutputPath(job,new Path("/Users/kris/Downloads/rjoin/output"));

        boolean waitForCompletion = job.waitForCompletion(true);
        System.out.println(waitForCompletion);
    }
}

==The disadvantage of the above implementation is that the join operation is completed in the reduce phase, the processing pressure at the reduce side is too high, the computing load on the map node is low, the resource utilization is low, and data skew is prone to occur in the reduce phase==

Implementation of 2-map end join

This applies when there are small tables in the associated table:
Small tables can be distributed to all map nodes so that map nodes can join and output results locally on the large table data they read.
It can greatly improve the concurrency of join operation and speed up processing.

Write Mapper

On the Mapper side we load data at once or use Distributedbache to copy files to each running maptask node

Here we use the second, join ing small tables defined in the mapper class
static class RjoinMapper extends Mapper<LongWritable,Text,RJoinInfo,NullWritable>{

        private static Map<String, RJoinInfo> productMap = new HashMap<>();

//      The setup method is called before the map method is called in a loop.So we can start with the file in the setup method
        @Override
        protected void setup(Context context) throws IOException, InterruptedException {

            //With these sentences of code, you can get the absolute local path of the cache file to test the validation
            URI[] cacheFiles = context.getCacheFiles();
            System.out.println(Arrays.toString(new URI[]{cacheFiles[0]}));

//          Specify the name directly and look up 1_in the working folder directory by default
            try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream("products.txt")))){

                String line;
                while ((line = bufferedReader.readLine())!=null){
                    String[] split = line.split("\t");
                    String productID = split[0];
                    String setProductBrand = split[1];
                    String productModel = split[2];

                    RJoinInfo rJoinInfo = new RJoinInfo();
                    rJoinInfo.setProductID(productID);
                    rJoinInfo.setProductBrand(setProductBrand);
                    rJoinInfo.setProductModel(productModel);
                    rJoinInfo.setFlag(0);
                    productMap.put(productID, rJoinInfo);
                }
            }

            super.setup(context);
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            FileSplit fileSplit = (FileSplit)context.getInputSplit();

            String name = fileSplit.getPath().getName();

            if (name.contains("orders")) {
                String line = value.toString();

                String[] split = line.split("\t");
                String orderID = split[0];
                String customerName = split[1];
                String cutsomerAddr = split[2];
                String productID = split[3];
                String orderNum = split[4];

                RJoinInfo rJoinInfo = productMap.get(productID);
                rJoinInfo.setProductID(productID);
                rJoinInfo.setCustomerName(customerName);
                rJoinInfo.setCustomerAddr(cutsomerAddr);
                rJoinInfo.setOrderID(orderID);
                rJoinInfo.setOrderNum(Integer.parseInt(orderNum));
                rJoinInfo.setFlag(1);

                context.write(rJoinInfo, NullWritable.get());
            }
        }
    }

Code explanation: Here we have rewritten a setup() method that executes before the map() method is executed, so we can load the data in advance in this method.
In the above code, we get the product.txt file by specifying the name directly. How exactly does this file copy on the node of maptask depend on the driver below?

Write Driver

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
        Configuration conf = new Configuration();
        conf.set("mapreduce.framework.name","local");
        conf.set("fs.defaultFS","file:///");

        Job job = Job.getInstance(conf);
        job.setJarByClass(RJoinDemoInMapDriver.class);

        job.setMapperClass(RjoinMapper.class);
        job.setOutputKeyClass(RJoinInfo.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job,new Path("/Users/kris/Downloads/rjoin/input"));
        FileOutputFormat.setOutputPath(job,new Path("/Users/kris/Downloads/rjoin/output2"));

//        Specifies that a file needs to be cached to the working directory of all maptask running nodes
//        job.addFileToClassPath(); caches plain files under the classpath of the task runtime node
//        job.addArchiveToClassPath(); caches jar packages under the classpath of the task runtime node
//        job.addCacheArchive(); caches the zipped package file to the working directory of the task runtime node
//        job.addCacheFile(); file 1_
        job.addCacheFile(new URI("/Users/kris/Downloads/rjoin/products.txt"));

//      Set the number of reduce s to 0
        job.setNumReduceTasks(0);


        boolean waitForCompletion = job.waitForCompletion(true);
        System.out.println(waitForCompletion);

    }

Code explanation: In the above Driver, we specify a URI local address through job.addCacheFile(), which mapreduce will copy to the running working directory of maptask at run time.

Okay - this issue has a lot of code to share, mainly to share ideas on how to use mapreduce to join.Next I'll talk about the idea and code of calculating mutual friends.

        Public number search: Xinxin XiCent er gets more welfare resources

This article is published by blog OpenWrite Release!

Topics: Java SQL