Happy Mid-Autumn Festival, everyone. I haven't updated any new articles for a long time. Today, share how to use mapreduce to join.
In offline computing, we often operate on more than a single file, requiring more data to be associated with two or more files, similar to join operations in sql.
Share with you today how to implement join in MapReduce
demand
There are two, one is the product information table and the other is the order table.Only the product ID is stored in the order form, and associations must be used if you want to find out the order and information about the product.
Realization
According to the MapReduce feature, it is well known that on the reduce side, the same key, value pair will be placed in the same reduce method (without setting the partition). Using this feature, we can easily implement the join operation, see the example below.
Product Table
ID | brand | model |
---|---|---|
p0001 | Apple | iphone11 pro max |
p0002 | Huawei | p30 |
p0003 | millet | mate10 |
Order form
id | name | address | produceID | num |
---|---|---|---|---|
00001 | kris | Futian District, Shenzhen | p0001 | 1 |
00002 | pony | Shenzhen Nanshan District | p0001 | 2 |
00003 | jack | Bantian District, Shenzhen | p0001 | 3 |
If the amount of data is huge and the data of two tables is stored in HDFS as files, you need to use mapreduce program to implement SQL query operation:
select a.id,a.name,a.address,a.num from t_orders a join t_products on a.productID=b.ID
MapReduce implementation ideas
By using the associated condition (prodcueID) as the key for map output, two tables of data satisfying the join condition are sent to the same destination with the file information from which the data originates reduce task, concatenating data in reduce
Implementation 1-reduce end join
Define a Bean
public class RJoinInfo implements Writable{ private String customerName=""; private String customerAddr=""; private String orderID=""; private int orderNum; private String productID=""; private String productBrand=""; private String productModel=""; // 0 is product, 1 is order private int flag; setter/getter
Write Mapper
public class RJoinMapper extends Mapper<LongWritable,Text,Text,RJoinInfo> { private static Logger logger = LogManager.getLogger(RJoinMapper.class); private RJoinInfo rJoinInfo = new RJoinInfo(); private Text k = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Many of them include databases and so on.This is a file, so you can force it directly into a file slice FileSplit fileSplit = (FileSplit) context.getInputSplit(); // Get file name String name = fileSplit.getPath().getName(); logger.info("splitPathName:"+name); String line = value.toString(); String[] split = line.split("\t"); String productID = ""; if(name.contains("product")){ productID = split[0]; String setProductBrand = split[1]; String productModel = split[2]; rJoinInfo.setProductID(productID); rJoinInfo.setProductBrand(setProductBrand); rJoinInfo.setProductModel(productModel); rJoinInfo.setFlag(0); }else if(name.contains("orders")){ String orderID = split[0]; String customerName = split[1]; String cutsomerAddr = split[2]; productID = split[3]; String orderNum = split[4]; rJoinInfo.setProductID(productID); rJoinInfo.setCustomerName(customerName); rJoinInfo.setCustomerAddr(cutsomerAddr); rJoinInfo.setOrderID(orderID); rJoinInfo.setOrderNum(Integer.parseInt(orderNum)); rJoinInfo.setFlag(1); } k.set(productID); context.write(k,rJoinInfo); } }
Code explanation, depending on the file name of split, determines whether products or orders,
Then get different data based on whether it's productID or orders, most often sent to Reduce with productID as Key
Write Reducer
public class RJoinReducer extends Reducer<Text,RJoinInfo,RJoinInfo,NullWritable> { private static Logger logger = LogManager.getLogger(RJoinReducer.class); @Override protected void reduce(Text key, Iterable<RJoinInfo> values, Context context) throws IOException, InterruptedException { List<RJoinInfo> orders = new ArrayList<>(); String productID = key.toString(); logger.info("productID:"+productID); RJoinInfo rJoinInfo = new RJoinInfo(); for (RJoinInfo value : values) { int flag = value.getFlag(); if (flag == 0) { // product try { BeanUtils.copyProperties(rJoinInfo,value); } catch (IllegalAccessException e) { logger.error(e.getMessage()); } catch (InvocationTargetException e) { logger.error(e.getMessage()); } }else { // Order RJoinInfo orderInfo = new RJoinInfo(); try { BeanUtils.copyProperties(orderInfo,value); } catch (IllegalAccessException e) { logger.error(e.getMessage()); } catch (InvocationTargetException e) { logger.error(e.getMessage()); } orders.add(orderInfo); } } for (RJoinInfo order : orders) { rJoinInfo.setOrderNum(order.getOrderNum()); rJoinInfo.setOrderID(order.getOrderID()); rJoinInfo.setCustomerName(order.getCustomerName()); rJoinInfo.setCustomerAddr(order.getCustomerAddr()); // Output key only, value can use nullwritable context.write(rJoinInfo,NullWritable.get()); } } }
Code Interpretation: Depending on the productID, different groups are sent to the reduce side, which has one product object and multiple order objects when it gets the last set of data.
Traverse through each object, distinguishing products and orders by flag.Save product objects and get each order object into a collection.When we divide each object
After class, traversing the order collection will collect the order and product information, and then output.
Note: Although we are not the most efficient here, we mainly want to illustrate the idea of join.
Write Driver
public class RJoinDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); // conf.set("mapreduce.framework.name","yarn"); // conf.set("yarn.resourcemanager.hostname","server1"); // conf.set("fs.defaultFS","hdfs://server1:9000"); conf.set("mapreduce.framework.name","local"); conf.set("fs.defaultFS","file:///"); Job job = Job.getInstance(conf); // If it's running locally, you don't have to set the path to the jar package, because you don't have to copy the jar elsewhere job.setJarByClass(RJoinDriver.class); // job.setJar("/Users/kris/IdeaProjects/bigdatahdfs/target/rjoin.jar"); job.setMapperClass(RJoinMapper.class); job.setReducerClass(RJoinReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(RJoinInfo.class); job.setOutputKeyClass(RJoinInfo.class); job.setOutputValueClass(NullWritable.class); FileInputFormat.setInputPaths(job,new Path("/Users/kris/Downloads/rjoin/input")); FileOutputFormat.setOutputPath(job,new Path("/Users/kris/Downloads/rjoin/output")); boolean waitForCompletion = job.waitForCompletion(true); System.out.println(waitForCompletion); } }
==The disadvantage of the above implementation is that the join operation is completed in the reduce phase, the processing pressure at the reduce side is too high, the computing load on the map node is low, the resource utilization is low, and data skew is prone to occur in the reduce phase==
Implementation of 2-map end join
This applies when there are small tables in the associated table: Small tables can be distributed to all map nodes so that map nodes can join and output results locally on the large table data they read. It can greatly improve the concurrency of join operation and speed up processing.
Write Mapper
On the Mapper side we load data at once or use Distributedbache to copy files to each running maptask node Here we use the second, join ing small tables defined in the mapper class
static class RjoinMapper extends Mapper<LongWritable,Text,RJoinInfo,NullWritable>{ private static Map<String, RJoinInfo> productMap = new HashMap<>(); // The setup method is called before the map method is called in a loop.So we can start with the file in the setup method @Override protected void setup(Context context) throws IOException, InterruptedException { //With these sentences of code, you can get the absolute local path of the cache file to test the validation URI[] cacheFiles = context.getCacheFiles(); System.out.println(Arrays.toString(new URI[]{cacheFiles[0]})); // Specify the name directly and look up 1_in the working folder directory by default try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream("products.txt")))){ String line; while ((line = bufferedReader.readLine())!=null){ String[] split = line.split("\t"); String productID = split[0]; String setProductBrand = split[1]; String productModel = split[2]; RJoinInfo rJoinInfo = new RJoinInfo(); rJoinInfo.setProductID(productID); rJoinInfo.setProductBrand(setProductBrand); rJoinInfo.setProductModel(productModel); rJoinInfo.setFlag(0); productMap.put(productID, rJoinInfo); } } super.setup(context); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { FileSplit fileSplit = (FileSplit)context.getInputSplit(); String name = fileSplit.getPath().getName(); if (name.contains("orders")) { String line = value.toString(); String[] split = line.split("\t"); String orderID = split[0]; String customerName = split[1]; String cutsomerAddr = split[2]; String productID = split[3]; String orderNum = split[4]; RJoinInfo rJoinInfo = productMap.get(productID); rJoinInfo.setProductID(productID); rJoinInfo.setCustomerName(customerName); rJoinInfo.setCustomerAddr(cutsomerAddr); rJoinInfo.setOrderID(orderID); rJoinInfo.setOrderNum(Integer.parseInt(orderNum)); rJoinInfo.setFlag(1); context.write(rJoinInfo, NullWritable.get()); } } }
Code explanation: Here we have rewritten a setup() method that executes before the map() method is executed, so we can load the data in advance in this method.
In the above code, we get the product.txt file by specifying the name directly. How exactly does this file copy on the node of maptask depend on the driver below?
Write Driver
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException { Configuration conf = new Configuration(); conf.set("mapreduce.framework.name","local"); conf.set("fs.defaultFS","file:///"); Job job = Job.getInstance(conf); job.setJarByClass(RJoinDemoInMapDriver.class); job.setMapperClass(RjoinMapper.class); job.setOutputKeyClass(RJoinInfo.class); job.setOutputValueClass(NullWritable.class); FileInputFormat.setInputPaths(job,new Path("/Users/kris/Downloads/rjoin/input")); FileOutputFormat.setOutputPath(job,new Path("/Users/kris/Downloads/rjoin/output2")); // Specifies that a file needs to be cached to the working directory of all maptask running nodes // job.addFileToClassPath(); caches plain files under the classpath of the task runtime node // job.addArchiveToClassPath(); caches jar packages under the classpath of the task runtime node // job.addCacheArchive(); caches the zipped package file to the working directory of the task runtime node // job.addCacheFile(); file 1_ job.addCacheFile(new URI("/Users/kris/Downloads/rjoin/products.txt")); // Set the number of reduce s to 0 job.setNumReduceTasks(0); boolean waitForCompletion = job.waitForCompletion(true); System.out.println(waitForCompletion); }
Code explanation: In the above Driver, we specify a URI local address through job.addCacheFile(), which mapreduce will copy to the running working directory of maptask at run time.
Okay - this issue has a lot of code to share, mainly to share ideas on how to use mapreduce to join.Next I'll talk about the idea and code of calculating mutual friends.
Public number search: Xinxin XiCent er gets more welfare resources
This article is published by blog OpenWrite Release!