Java - access and download of network resources based on URL and IO stream

Posted by whisher06 on Fri, 04 Feb 2022 13:02:13 +0100

     As the basic unit for executing File IO operations in JavaSE system, IO stream can be realized on the local machine, such as File deletion, creation, copy, rename And other basic operations. However, through the jdk source code, it will be found that the File class itself implements the Serializable interface, which means that a File object is Serializable, so the File IO operation naturally supports the remote network transmission (upload / download) operation of files.
     For example: combined with Java Net package and URLConnection interface can realize the simple download function of network resources.
     The following first introduces the definition and access path representation of File file in Java program; Secondly, it briefly describes the relationship between network resources and files, and the relationship between URL resource locator and File path; Thirdly, it introduces how to complete the local download operation of network resources through URLConnection; Finally, it introduces the case of HttpURLConnection to simulate the browser to search movie resources.

1. Analysis of file annotation

   JDK The document comments in the source code will File Classes are described as (logical) abstract representations of (physical) files and paths.
   		The user interface and operating system use strings that depend on the system path name to represent "files and paths"( user interfaces and 
   		operating systems use system-dependent pathname-strings to name files and 
   		directories)，File Class represents an abstract( abstract),System independent( system-independent )
   		In the hierarchical path view, an abstract path name contains two parts:
   			[1]An optional system independent prefix string—— Path separator:/perhaps\\
   			[2]0 One or more file names
   			To:
   			F:\Java-dependencies\apache-tomcat-9.0.43\apache-tomcat-9.0.43\conf\server.xml
   			perhaps
   			https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json, for example,
   		-->The first name of the abstract path may be a folder name(F:\Java-dependencies\apache-tomcat-9.0.43\
   		apache-tomcat-9.0.43\conf\)，It may also be a host name( geo.datav.aliyun.com)，Characters at each level
   		The sequence can be located to a folder (or a level domain name), and the last name is the file name( server.xml,
   		100000_full.json)Or directory name.

Combined with the actual code, it is not difficult to find that even if there is no path or File on the disk, the object of File class can still be created through the new keyword, which is precisely because of the abstract ion and system independence of the File at the logical concept level.
However, at the physical concept level, when actually moving, copying, deleting and renaming a File object, the corresponding string path must be required to exist on the disk. This is precisely because - at the bottom of the operating system under the JVM, User interfaces and operating systems use system dependent pathname strings to name files and directories
At present, the file path can be divided into relative path and absolute path, which will not be studied here. Just be clear: the string path used to represent the file in Java can be either this or that:

So: F:\Java-dependencies\apache-tomcat-9.0.43\apache-tomcat-9.0.43\conf\server.xml
 That way: https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json

2 Relationship between network resources and documents

According to the annotation analysis of 1 File class file, since the file path can be abstractly expressed as: https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json Then, the essence of network resources should be: an ordinary file stored on the server, such as pdf, xml, png, mp4
There are different types of files, so it is natural for network resources. The types of files can be distinguished according to the encoding format (GBK, UTF-8...), suffix (*. pdf, *. jpg, *. xml, *. png, *. mp3...), etc. How to distinguish the types of network resources?

Network resources are also identified by content type response header tag in HTTP hypertext transfer protocol, which is used to tell the client-side - the content type of network resources currently being returned by the server-side. This content type can be set through the setContentType(String) method of ServletResponse, which will determine the form and encoding of the browser client to read this network resource (file). For example, the context type shown below is text/html, which indicates that the network resource currently being accessed is an HTML file.

The value of context type attribute is called MIME type (i.e. media type). Common media formats are as follows and can also be accessed https://www.runoob.com/http/http-content-type.html View more MIME type values.

3 network resources and Java net. URL class

Through the above interpretation, it can be seen that the network resource is essentially a common file resource existing on the server. The type of this resource file is also called MIME media type, which is implemented by HTTP as a part of the standard. So, how to access this network resource?

3.1 path representation of network resources

The path of the network resource is represented by a URL.
The URL (Uniform Resource Locator) is called the resource locator. Baidu interprets it as:

	stay WWW(World Wide Web)On the world wide web, each network resource has a unified and unique address on the Internet, which is called URL(Uniform Resource Locator,Uniform resource locator), which is the uniform resource location mark of the world wide web, refers to the network address.

Materialize the URL, that is, the string in the address bar of the browser window we usually see.

URL resource locator consists of four parts: protocol, host, port and path. The general syntax rules are as follows:

protocol :// hostname[:port] / path / [:parameters][?query]#fragment
 For example:
	https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json
 Then:
	protocol: https
	Domain name( hostname+port): geo.datav.aliyun.com((the host name and port number can be obtained through the domain name with the help of the domain name server)
	path: areas_v3/bound/100000_full.json

You can also use: Online domain name resolution tool.

3.2 java.net.URL class

In the Java programming language, its URL is abstracted as Java Net package. The part about its document comments is resolved as follows:

	URL Class represents a resource locator( Uniform Resource Locator)，Point to the World Wide Web( the World 
Wide Web)A resource on( Resource). 
	A resource can be a simple file or directory Java Is abstracted as File Class), or
 It can be a more complex object (other types: a query operation for a database or search engine).
	One URL The port number of is optional. If it is not explicitly specified, it defaults to 80.
	URL Class can implement something like 3.1 The "domain name resolution" function is mentioned in (but this URL Class itself does not have this function, but is internally maintained URLStreamHandler Abstract classes)

Get URL-https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json Basic information about.

package com.xwd.demo;

import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

public class URLDemo{
    //methods
    public static void main(String[] args) {
        try {
            URL url=new URL("https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json");
            URLConnection connection=url.openConnection();
            System.out.println(url.toString());
            //Get communication protocol
            String protocol = url.getProtocol();
            System.out.println("agreement="+protocol);
            //Get host name
            String host = url.getHost();
            System.out.println("host="+host);
            //Get port number
            int port = url.getPort();
            int defaultPort = url.getDefaultPort();
            System.out.println("port:"+port+",defaultPort:"+defaultPort);
            //Get request parameters
            String query = url.getQuery();
            System.out.println("Query parameters="+query);
            //
            String userInfo = url.getUserInfo();
            System.out.println("User information="+userInfo);

            String ref = url.getRef();
            System.out.println("URL Anchor point of#="+ref);

            String authority = url.getAuthority();
            System.out.println("jurisdiction="+authority);

            //Get file name
            String file = url.getFile();
            System.out.println("file name="+file);
            //Get path name
            String path = url.getPath();
            System.out.println("File path"+path);

        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

4 java.net.URLConnection and simple reading and writing of network resources

4.1 introduction to urlconnection

Based on the above interpretation, you can already get its corresponding file name and path through a URL that points to network resources. So, how to realize the read-write operation of network resources? For example: download a network resource locally through a URL.
Like file IO operations, IO operations of network resources need to obtain a URL connection object between a client and the server where the network resource is located, and then complete various IO operations through this connection object. The basic principle is shown in the figure below.

Java. Java is provided in the Java programming language net. Urlconnection to represent the connection channel between the client and network resources. Make the following brief interpretation of its document notes,

	URLConnection Abstract classes represent all client programs and URL Connection between( link)The parent class of the class.
	The object of this class can be called URL of openConnection()Method, which is used for reading and writing URL point
 Resources.

4.2 URLConnection enables simple downloading of network resources

The example code of downloading network resources is as follows:

package com.xwd.demo;

import java.io.*;
import java.net.URL;
import java.net.URLConnection;

/**
 * @ClassName IODemo
 * @Description: com.xwd.demo
 * @Auther: xiwd
 * @Date: 2022/2/4 - 02 - 04 - 16:27
 * @version: 1.0
 */
public class IODemo {
    //methods
    public static void main(String[] args) {
        URL url=null;
        URLConnection connection=null;
        InputStream inputStream=null;
        OutputStream outputStream=null;
        byte[] buffer=new byte[1024];
        int len=-1;
        try {
            //Provide URL - network resource locator
            url = new URL("https://geo.datav.aliyun.com/areas_v3/bound/100000_full.json");
            //Gets the name of the network resource
            String file = url.getFile();
            String filename= file.lastIndexOf("/")==-1?file: file.substring(file.lastIndexOf("/")+1,file.length());
            //Gets the connection object between the client and the URL
            connection = url.openConnection();
            //Get input stream object
            inputStream = connection.getInputStream();
            //Get the output stream object -- and specify the local storage location of network resources
            outputStream = new FileOutputStream(filename);
            //Download Network Resources
            while ((len = inputStream.read(buffer)) != -1) {
                outputStream.write(buffer,0,len);
            }
            System.out.println("SUCCESS");
        } catch (IOException e) {
            e.printStackTrace();
            System.out.println("FAILED");
        } finally {
            //Free stream resources
            if (outputStream!=null) {
                try {
                    outputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (inputStream!=null) {
                try {
                    inputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

4.3 HttpURLConnection

The HttpURLConnection abstract class is a subclass of the URLConnection abstract class. An object of this class can be used to send GET requests and POST requests to the specified website. However, the underlying network connections of the HTTP server may be shared by multiple objects (a multithreaded processing mechanism of HTTP server). After the request is completed, the cyber source can be accessed by calling the close() method. However, it will not have any impact on other persistent connection s.
Based on URLConnection, it provides the following convenient methods:

int getResponseCode(); // Get the response code of the server.
String getResponseMessage(); // Get the response message of the server.
String getResponseMethod(); // Gets the method to send the request.
void setRequestMethod(String method); // Set the method of sending the request.

4.4 search movie resources with HttpURLConnection simulation browser

The example code is as follows:
Among them: the response result obtained by the GET request is the source code of an HTML page, which is too lengthy to be printed; Later, Dom4j. Com can be generated according to the source code of the HTML page Jar package to realize page information crawling operation.

package com.xwd.demo;

import com.sun.jmx.snmp.SnmpNull;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;

/**
 * @ClassName IODemo
 * @Description: com.xwd.demo
 * @Auther: xiwd
 * @Date: 2022/2/4 - 02 - 04 - 16:27
 * @version: 1.0
 */
public class IODemo {
    //methods
    public static void main(String[] args) {
        HttpURLConnectionTest();
    }

	//http://www.sdpxgd.com/search.php?searchword= Painting Jianghu
    private static void HttpURLConnectionTest(Object... args) {
        URL url= null;
        HttpURLConnection connection =null;
        InputStream inputStream=null;
        BufferedReader reader=null;

        try{
            //Get URL object
            url = new URL("http://www.sdpxgd.com/search.php?searchword=%E7%94%BB%E6%B1%9F%E6%B9%96");
            //Get HttpURLConnection object
            connection = (HttpURLConnection) url.openConnection();
            //Set request parameters
            connection.setDoOutput(false);//Output to HttpURLConnection
            connection.setDoInput(true);//Whether to read from HttpURLConnection
            connection.setRequestMethod("GET");//Set request mode
            connection.setUseCaches(true);//Set whether to use cache
            connection.setInstanceFollowRedirects(true);//Sets whether HTTP redirection should be performed automatically
            connection.setConnectTimeout(3000);//Set timeout response time
            //Execute connection
            connection.connect();
            //Get status code
            int responseCode = connection.getResponseCode();
            //get data
            String msg="";
            if (responseCode==200){
                //Get input stream object
                inputStream = connection.getInputStream();
                reader=new BufferedReader(new InputStreamReader(inputStream));
                //Read information
                String line=null;
                while ((line=reader.readLine())!=null)
                    msg+=line+"\n";
            }
            //Query result printing
            //System.out.println(msg);// The HTML code of the search results page is printed here. If there is too much content, it will not be printed

            //Print response body information
            Map<String, List<String>> headerFields = connection.getHeaderFields();
            Set<Map.Entry<String, List<String>>> entries = headerFields.entrySet();
            Iterator<Map.Entry<String, List<String>>> iterator = entries.iterator();
            while (iterator.hasNext()) {
                System.out.println(iterator.next().toString());
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (reader!=null) {
                try {
                    reader.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            //Disconnect
            if (connection!=null) {
                connection.disconnect();
            }
        }

    }
}

Topics: Java Back-end

Programmer Think