Parsing csv file compatible bom header

Posted by stuartbates on Wed, 22 Dec 2021 21:44:48 +0100

Parsing csv file compatible bom header

background

Analytical compatibility

summary

background

Next Installing and configuring Sftp and accessing it through java , because the file we uploaded is a standard file csv format file generated by the program, and Party B summarizes the outbound call results through human flesh, creates a TXT file, and then modifies the suffix to become a csv file, which will lead to some problems in our program parsing, For example, the problem of bom file header (they are windows systems, and only when windows systems change txt to csv, there will be bom header problem), which leads to errors in our program parsing. Of course, as a programmer with moral character and pursuit, we will certainly not learn from them to parse in a meritorious way. Then, we will parse the csv file with bom header in a program compatible way.

Analytical compatibility

Introduce dependency

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-csv</artifactId>
  <version>1.5</version>
</dependency>

1. Regular csv file parsing

List<T> resultList = new ArrayList<>();
BufferedReader bufferedReader = null;
InputStreamReader inputStreamReader = null;
ByteArrayInputStream byteArrayInputStream = null;
CSVParser parser = null;
try {
    byteArrayInputStream = new ByteArrayInputStream(bytes);
    inputStreamReader = new InputStreamReader(byteArrayInputStream);
    //reader = new UnicodeReader(byteArrayInputStream,"utf-8");
    //bufferedReader = new BufferedReader(reader);
    bufferedReader = new BufferedReader(inputStreamReader);
    parser = CSVFormat.DEFAULT
            .withHeader("a","b")
            .withFirstRecordAsHeader()
            .parse(bufferedReader);
    //int rowIndex = 0;
    for (CSVRecord record : parser.getRecords()) {
        //transfer record to row
        T row = ...
        log.info("read data from ftp;row={}",row);
        resultList.add(row);
    }
} catch (UnsupportedEncodingException e) {
    log.error("occur error;filePath={}",filePath,e);
} catch (IOException e) {
    log.error("occur error;filePath={}",filePath,e);
} catch (Exception e) {
    log.error("occur error;filePath={}",filePath,e);
} finally {
    IOUtils.closeQuietly(byteArrayInputStream);
    IOUtils.closeQuietly(inputStreamReader);
    //IOUtils.closeQuietly(reader);
    IOUtils.closeQuietly(bufferedReader);
    IOUtils.closeQuietly(parser);
}

In this case, there is no problem parsing regular csv files, but files with bom headers cannot be parsed. The reason is that csv is also a plain text file in theory. It is not ruled out that the generated txt file has become a csv file by changing the suffix name, or the csv manually generated on the windows platform has a bom header. When you open the file with a command, you will find that the file header is garbled.

2. Use bom stream to analyze compatibility

List<T> resultList = new ArrayList<>();
BufferedReader bufferedReader = null;
InputStreamReader inputStreamReader = null;
ByteArrayInputStream byteArrayInputStream = null;
BOMInputStream bomInputStream = null;
CSVParser parser = null;
try {
    byteArrayInputStream = new ByteArrayInputStream(bytes);
    //Use BOMInputStream compatible bom header csv file
    bomInputStream = new BOMInputStream(byteArrayInputStream,false, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_16BE,ByteOrderMark.UTF_8);
    String charset = "UTF-8";
    if(bomInputStream.hasBOM()) {
        charset = bomInputStream.getBOMCharsetName();
    }
    inputStreamReader = new InputStreamReader(bomInputStream, Charset.forName(charset));
    //reader = new UnicodeReader(byteArrayInputStream,"utf-8");
    //bufferedReader = new BufferedReader(reader);
    bufferedReader = new BufferedReader(inputStreamReader);
    parser = CSVFormat.DEFAULT
            .withHeader("a","b")
            .withFirstRecordAsHeader()
            .parse(bufferedReader);
    //int rowIndex = 0;
    for (CSVRecord record : parser.getRecords()) {
    T row = ...
        log.info("read data from ftp;row={}",row);
        resultList.add(row);
    }
} catch (UnsupportedEncodingException e) {
    log.error("occur error;filePath={}",filePath,e);
} catch (IOException e) {
    log.error("occur error;filePath={}",filePath,e);
} catch (Exception e) {
    log.error("occur error;filePath={}",filePath,e);
} finally {
    IOUtils.closeQuietly(byteArrayInputStream);
    IOUtils.closeQuietly(bomInputStream);
    IOUtils.closeQuietly(inputStreamReader);
    IOUtils.closeQuietly(bufferedReader);
    IOUtils.closeQuietly(parser);
}

The principle is that the bom header can be detected in the bom flow, and the bom is exclude d in the flow.

3. Use Unicode reader to parse compatibility

Similar to the above Codes:

UnicodeReader ur = new UnicodeReader(fis, "utf-8"); 
bufferedReader = new BufferedReader(ur);

Unicode reader realizes the automatic detection and filtering reading of BOM through PushbackInputStream+InputStreamReader; When no BOM is detected, the pushback stream will fallback and read it with the code passed in by the constructor. Otherwise, the code corresponding to BOM is used for reading.

summary

For 2 and 3 in the previous section, the 3 mode is relatively lighter and more powerful; In addition, it is also more transparent. You can modify the source code to meet your needs.

Unicode reader reference: http://akini.mbnet.fi/java/unicodereader/