sparksql insert postgresql field format mismatch error handling

Posted by jek on Thu, 20 Jan 2022 04:23:07 +0100

1. Error Key Information

Caused by: org.postgresql.util.PSQLException: ERROR: column "c1" is of type point but expression is of type character
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2553)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2285)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:323)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:859)

Detailed exception examples:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, info3, executor 1): java.sql.BatchUpdateException: Batch entry 0 INSERT INTO sink_pg_newtable_10 ("column1","column2","column3","column4","column5","column6","column7","column8","column9","column10","column11","column13","column14","column15","column16","column17","column18","column19","column20","column21","column22","column23","column24","column25","column27","column28","column29","column30","column31","column32","column33","column34","column35","column36","column37","column38","column39","column40","column41","column42","column44","column45","column46","column47","column48","column49","column50","column51","column52","column53","column54","column55","column56","column57","column58","column59","column60") VALUES (NULL,'2','true','false',NULL,'(1,2),(1,2)',NULL,'1','192','192.168.0.0/32','<(1,1),5>',NULL,'[2020-09-10,2020-10-09)',NULL,NULL,NULL,'192.168.1.32','11','1','11','[2,4)','11','[124,456)','78','11 days 11:11:11','{"a":1,"b":2}','{"a": 1, "b": 2}','$."action"','{1,-1,0}','[(1,1),(2,2)]','08:00:2b:01:02:03','08:00:2b:ff:fe:01:02:03','11.0',NULL,NULL,NULL,NULL,NULL,NULL,NULL,'3','3','3',NULL,'3',NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL) was aborted: ERROR: column "column1" is of type bigint but expression is of type character
	at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:169)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:862)
	at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:901)
	at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1644)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:676)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:838)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:838)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.postgresql.util.PSQLException: ERROR: column "column1" is of type bigint but expression is of type character
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2553)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2285)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:323)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:859)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:980)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:978)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
	at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:978)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:838)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:68)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	......
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO sink_pg_newtable_10 ("column1","column2","column3","column4","column5","column6","column7","column8","column9","column10","column11","column13","column14","column15","column16","column17","column18","column19","column20","column21","column22","column23","column24","column25","column27","column28","column29","column30","column31","column32","column33","column34","column35","column36","column37","column38","column39","column40","column41","column42","column44","column45","column46","column47","column48","column49","column50","column51","column52","column53","column54","column55","column56","column57","column58","column59","column60") VALUES (NULL,'2','true','false',NULL,'(1,2),(1,2)',NULL,'1','192','192.168.0.0/32','<(1,1),5>',NULL,'[2020-09-10,2020-10-09)',NULL,NULL,NULL,'192.168.1.32','11','1','11','[2,4)','11','[124,456)','78','11 days 11:11:11','{"a":1,"b":2}','{"a": 1, "b": 2}','$."action"','{1,-1,0}','[(1,1),(2,2)]','08:00:2b:01:02:03','08:00:2b:ff:fe:01:02:03','11.0',NULL,NULL,NULL,NULL,NULL,NULL,NULL,'3','3','3',NULL,'3',NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL) was aborted: ERROR: column "column1" is of type bigint but expression is of type character
	at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:169)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:862)
	at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:901)
	at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1644)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:676)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:838)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:838)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.postgresql.util.PSQLException: ERROR: column "column1" is of type bigint but expression is of type character
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2553)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2285)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:323)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:859)

2. Solutions

There are two situations to consider here:

1) In the first case, the insertion field value is not NULL

The main reason for this is that the data types do not match and pg cannot convert properly when inserting data. For example, if field C1 is of type point and the value is (1,2), the insertion statement is like "insert in table (c1) values ('(1,2)');" The C1 field value here is handled by default as varchar type, and pg cannot correctly convert it to point type. The solution is simple: simply configure the database connection url parameter:'stringtype=unspecified', stating that pg handles string conversion logic automatically.

url=jdbc:postgresql://192.168.1.84:5432/postgres?stringtype=unspecified

2) In the second case, the insertion field value is NULL

The problem with this is mainly in the sparksql code, so let's analyze it. Sparksql needs to convert data types when it performs jdbc operations, and the types supported in spark and database need to be mapped. This part of the implementation is mainly in different database dialects, such as the dialect implementation class of pg database: org.apache.spark.sql.jdbc.PostgresDialect,sparksql is used through JdbcDialects. RegiserDialect (PostgresDialect) registers and uses it.

Then let's take an example to analyze the solution or the C1 field, which has a NULL value and a point type. Then execute sql "insert into table (c1) values(NULL);". It is reasonable to say that this sql is executed directly in pg without error if the field has no non-empty constraint. Why did you make a mistake?

debug follows the code to sparksql class org. Apache. Spark. Sql. Execution. Datasources. Jdbc. JdbcUtils. Line 661 of the savePartition () method, where the field value is set by setting a parameter with PrepareStatement, is called stmt when the value is NULL. The setNull (i + 1, nullTypes (i)) method is designed to handle NULL values.

while (i < numFields) {
            if (row.isNullAt(i)) {
              stmt.setNull(i + 1, nullTypes(i)) //Note that NULL values are specifically handled here
            } else {
              setters(i).apply(stmt, row, i)
            }
            i = i + 1
          }

The problem is that the setNull() method has two parameters, the first is the field subscript, and the second is the mapping type when the null value is inserted, which tells the database that the number is to be processed according to that type when the inserted field value is null. That makes it clear that we just specify the correct type. But the question arises again. What type of field is it in the database? This is unknown beforehand because spark does not support all database field types, such as the point type in pg and the StringType type mapped to spark, but it is impossible to infer the type in the database from StringType in turn, because there may be many types that have been converted to StringType. Such as: path,jsonb and other types supported by pg but not supported by generic sql.

Consider the code nullTypes(i), where nullTypes is an array that stores the type of mapping that corresponds to each field's null value. Where does he assign values? Look at the code

val nullTypes = rddSchema.fields.map(f => getJdbcType(f.dataType, dialect).jdbcNullType) //Here, get the null mapping type from the datatype obtained from the getJdbcType method

From the code above we can see that he got it from the getJdbcType method of the dialect implementation class. Let's look at the implementation of this method in the dialect of pg

  override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
    case StringType => Some(JdbcType("TEXT", Types.CHAR)) //Notice here that c1 corresponds to this, and when a null value is seen, the NULL value of c1 will be inserted according to the CHAR type, which will naturally cause errors
    case BinaryType => Some(JdbcType("BYTEA", Types.BINARY))
    case BooleanType => Some(JdbcType("BOOLEAN", Types.BOOLEAN))
    case FloatType => Some(JdbcType("FLOAT4", Types.FLOAT))
    case DoubleType => Some(JdbcType("FLOAT8", Types.DOUBLE))
    case ShortType => Some(JdbcType("SMALLINT", Types.SMALLINT))
    case t: DecimalType => Some(
      JdbcType(s"NUMERIC(${t.precision},${t.scale})", java.sql.Types.NUMERIC))
    case ArrayType(et, _) if et.isInstanceOf[AtomicType] =>
      getJDBCType(et).map(_.databaseTypeDefinition)
        .orElse(JdbcUtils.getCommonJDBCType(et).map(_.databaseTypeDefinition))
        .map(typeName => JdbcType(s"$typeName[]", java.sql.Types.ARRAY))
    case ByteType => throw new IllegalArgumentException(s"Unsupported type in postgresql: $dt");
    case _ => None
  }

From the code case StringType => Some (JdbcType ("TEXT", Types.CHAR)), we can see that c1 corresponds to the StringType type. When null, the NULL value of c1 is inserted according to the CHAR type, which is not the real point type in the database and will naturally cause errors.

What about that? In fact, we can specify an empty type and let pg decide which type of insert to use ultimately, so as to avoid making mistakes after the dead type. Then rewrite the PostgresDialect dialect, rewrite the getJDBCType method, and modify it as follows (java version):

import java.sql.Connection;
import java.sql.SQLException;
import java.sql.Types;

import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions;
import org.apache.spark.sql.jdbc.JdbcDialect;
import org.apache.spark.sql.jdbc.JdbcType;
import org.apache.spark.sql.types.ArrayType$;
import org.apache.spark.sql.types.BinaryType$;
import org.apache.spark.sql.types.BooleanType$;
import org.apache.spark.sql.types.ByteType$;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DateType$;
import org.apache.spark.sql.types.DecimalType;
import org.apache.spark.sql.types.DecimalType$;
import org.apache.spark.sql.types.DoubleType$;
import org.apache.spark.sql.types.FloatType$;
import org.apache.spark.sql.types.IntegerType$;
import org.apache.spark.sql.types.LongType$;
import org.apache.spark.sql.types.MetadataBuilder;
import org.apache.spark.sql.types.ShortType$;
import org.apache.spark.sql.types.StringType$;
import org.apache.spark.sql.types.TimestampType$;

import scala.None$;
import scala.Option;
import scala.Option$;
import scala.Some;
import scala.collection.JavaConverters;
import scala.collection.immutable.Map;

public class PostgresDialect extends JdbcDialect {
	private static final long serialVersionUID = -5826284056572945657L;

	@Override
	public boolean canHandle(String url) {
		return url.startsWith("jdbc:postgresql:");
	}
	
	@SuppressWarnings({ "rawtypes", "unchecked" })
	@Override
	public Option getCatalystType(int sqlType, String typeName, int size, MetadataBuilder md) {
		if (sqlType == Types.REAL) {
			return Option$.MODULE$.apply(FloatType$.MODULE$);
		} else if (sqlType == Types.SMALLINT) {
			return Option$.MODULE$.apply(ShortType$.MODULE$);
		} else if (sqlType == Types.BIT && typeName.equals("bit") && size != 1) {
			return Option$.MODULE$.apply(BinaryType$.MODULE$);
		} else if (sqlType == Types.OTHER) {
			return Option$.MODULE$.apply(StringType$.MODULE$);
		} else if (sqlType == Types.ARRAY) {
			int scale = (int) md.build().getLong("scale");
			return Option$.MODULE$.apply(ArrayType$.MODULE$.apply(toCatalystType(typeName, size, scale).get()));
		} else {
			return None$.MODULE$;
		}
	}

	private Option<DataType> toCatalystType(String typeName, int precision, int scale) {
		switch (typeName) {
		case "bool":
			return Option$.MODULE$.apply(BinaryType$.MODULE$);
		case "int2":
			return Option$.MODULE$.apply(ShortType$.MODULE$);
		case "int4":
			return Option$.MODULE$.apply(IntegerType$.MODULE$);
		case "int8":
		case "oid":
			return Option$.MODULE$.apply(LongType$.MODULE$);
		case "float4":
			return Option$.MODULE$.apply(FloatType$.MODULE$);
		case "money":
		case "float8":
			return Option$.MODULE$.apply(DoubleType$.MODULE$);
		case "text":
		case "varchar":
		case "char":
		case "cidr":
		case "inet":
		case "json":
		case "jsonb":
		case "uuid":
			return Option$.MODULE$.apply(StringType$.MODULE$);
		case "bytea":
			return Option$.MODULE$.apply(BinaryType$.MODULE$);
		case "timestamp":
		case "timestamptz":
		case "time":
		case "timetz":
			return Option$.MODULE$.apply(TimestampType$.MODULE$);
		case "date":
			return Option$.MODULE$.apply(DateType$.MODULE$);
		case "numeric":
		case "decimal":
			if (precision > 0) {
				return Option$.MODULE$.apply(new DecimalType(Math.min(precision, DecimalType$.MODULE$.MAX_PRECISION()),
						Math.min(scale, DecimalType$.MODULE$.MAX_SCALE())));
			} else {
				return Option$.MODULE$.apply(new DecimalType(DecimalType$.MODULE$.MAX_PRECISION(), 18));
			}
		default:
			return null;
		}

	}

	@SuppressWarnings({ "unchecked", "rawtypes" })
	@Override
	public Option<JdbcType> getJDBCType(DataType dt) {
		Object obj;
		DataType datatype = dt;
		if (StringType$.MODULE$.equals(datatype)) {
			obj = new Some(new JdbcType("TEXT", Types.NULL)); // Change to Types.NULL
		} else if (BinaryType$.MODULE$.equals(datatype)) {
			obj = new Some(new JdbcType("BYTEA", Types.BINARY));
		} else if (BooleanType$.MODULE$.equals(datatype)) {
			obj = new Some(new JdbcType("BOOLEAN", Types.BOOLEAN));
		} else if (FloatType$.MODULE$.equals(datatype)) {
			obj = new Some(new JdbcType("FLOAT4", Types.FLOAT));
		} else if (DoubleType$.MODULE$.equals(datatype)) {
			obj = new Some(new JdbcType("FLOAT8", Types.DOUBLE));
		} else if (ShortType$.MODULE$.equals(datatype)) {
			obj = new Some(new JdbcType("SMALLINT", Types.SMALLINT));
		} else if (DecimalType$.MODULE$.SYSTEM_DEFAULT().equals(datatype)) {
			obj = new Some(new JdbcType("NUMBER(38,18)", Types.NUMERIC));
		} else if (ByteType$.MODULE$.equals(datatype)) {
			throw new IllegalArgumentException("Unsupported type in postgresql:" + dt);
		} else {
			obj = None$.MODULE$;
		}
		return ((Option) (obj));
	}

	@Override
	public String getTableExistsQuery(String table) {
		return "SELECT 1 FROM " + table + " LIMIT 1";
	}

	@Override
	public String getTruncateQuery(String table, Option<Object> cascade) {
		Object object = cascade.get();
		if (object != null && Boolean.valueOf(object.toString())) {
			return "TRUNCATE TABLE ONLY " + table + " CASCADE";
		}
		return "TRUNCATE TABLE ONLY" + table;

	}

	@Override
	public void beforeFetch(Connection connection, Map<String, String> properties) {
		super.beforeFetch(connection, properties);
		java.util.Map<String, String> javaMap = JavaConverters.mapAsJavaMapConverter(properties).asJava();
		String stringOption = javaMap.get(JDBCOptions.JDBC_BATCH_FETCH_SIZE());
		if (!stringOption.isEmpty() && Integer.valueOf(stringOption) > 0) {
			try {
				connection.setAutoCommit(false);
			} catch (SQLException e) {
				throw new RuntimeException(e);
			}
		}
	}

}

Major changes here

if (StringType$.MODULE$.equals(datatype)) {
            obj = new Some(new JdbcType("TEXT", Types.NULL)); // Change to Types.NULL
        }

Then uninstall the original dialect, just use the custom dialect.

JdbcDialects.unregisterDialect(org.apache.spark.sql.jdbc.PostgresDialect$.MODULE$); //Unload the original dialect implementation
JdbcDialects.registerDialect(new PostgresDialect()); //Register custom dialect implementations

//Data manipulation logic
 ........

Disclaimer: Do not take these details into account where there are inconsistent definitions and naming in the text, but provide examples to illustrate that strict correspondence is not required.

Topics: PostgreSQL Spark J2EE

Programmer Think

sparksql insert postgresql field format mismatch error handling

1. Error Key Information

2. Solutions

1) In the first case, the insertion field value is not NULL

2) In the second case, the insertion field value is NULL

Hot Topics