Meaning of spark.sql.hive.caseSensitiveInferenceMode parameter of spark

Posted by kate_rose on Fri, 12 Nov 2021 18:22:59 +0100

This paper combs and summarizes the parameter meaning and use of spark.sql.hive.caseSensitiveInferenceMode of spark

1. Parameter meaning
Spark 2.1.1 introduces a new configuration item: spark.sql.hive.caseSensitiveInferenceMode. The default value is NEVER_INFER, maintain behavior consistent with spark 2.1.0. However, Spark 2.2.0 changes the default value of this configuration to include_ AND_ Save to restore the compatibility of reading Hive meta storage tables containing mixed case column names in the underlying file schema. Use infer_ AND_ When saving, spark will perform schema inference on any Hive metastore table that does not save the inferred schema results during the first access. Note that schema inference is a time-consuming operation for tables with thousands of partitions. Regardless of compatibility with case mixed column names, you can safely set spark.sql.hive.caseSensitiveInferenceMode to never_ Reference to avoid the initial overhead of schema inference. Note that the new default setting of include is used_ AND_ Save, the result of schema inference will be saved as a meta storage key for future use. Therefore, the initial schema inference only occurs when the table is accessed for the first time.

From Spark 2.2.1 and 2.3.0, when the data source table has columns that exist in both partition schema and data schema, the schema is always inferred at run time. The inferred schema has no partition columns. When reading the table, Spark only considers the partition values of these overlapping columns, not the values stored in the data source file. After the release of 2.2.0 and 2.1.x, the inferred schema is partitioned, but the data of the table is invisible to the user (that is, the result set is empty).

In Spark 2.4 and earlier versions, when reading Hive Serde table with Spark native data source (parquet/orc), Spark will infer the actual file schema and update the table schema in metastore. Starting from Spark 3.0, Spark no longer infers schema. This should not cause any problems for end users, but if so, set spark.sql.hive.caseSensitiveInferenceMode to include_ AND_ SAVE.

From different versions of spark, the default behavior of spark.sql.hive.caseSensitiveInferenceMode first never recommends Schema, then infers Schema at run time, but finally does not infer.

Note: you need to adjust the value of this parameter according to your own spark version to avoid unexpected results, such as super slow reading hive table and memory overflow

2. Source code
spark.sql.hive.caseSensitiveInferenceMode has three modes: infer and save (spark 2.4.x default mode), infer only and never

Infer and save refers to infer the case sensitive schema of the underlying data file and write back the attributes of the hive table.

Never infer means going back to storing the Schema using case insensitive meta instead of inferring

When the case sensitive schema cannot be read from the attributes of Hive table, spark adopts different behavior according to this parameter.

Although Spark SQL itself is not case sensitive, Hive compatible file formats, such as Parquet, are case sensitive

When Spark queries that the back-end storage file contains case sensitive column names, you must use a case sensitive schema, otherwise the returned results may be inaccurate.

object HiveCaseSensitiveInferenceMode extends Enumeration {
  val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
    .doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
      "table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
      "formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
      "any table backed by files containing case-sensitive field names or queries may not return " +
      "accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
      "case-sensitive schema from the underlying data files and write it back to the table " +
      "properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
      "properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
      "instead of inferring).")

reference resources

Topics: hive Spark SQL