Oct-11-2021, 04:04 AM
I want pyspark code to use parallel threads when connecting to the database when i am inserting into a table but its not.
I have tried splitting the DF , also used numPartitions atribute in the write call but nothing helping .
The following code works and it writes to the table but with a single database connection .
I have tried splitting the DF , also used numPartitions atribute in the write call but nothing helping .
The following code works and it writes to the table but with a single database connection .
mport os
import io
import findspark
import pandas as pd
import boto3
import awswrangler as wr
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "25g") \
.appName('my-cool-app') \
.getOrCreate()
myDF=spark.read.format('jdbc').options(
url='jdbc:redshift://hostname.com:5439/dev',
driver='com.amazon.redshift.jdbc42.Driver',
dbtable='schema1.table1',
user='awsuser',
password='securepassword').load()
myDF.count()
myDF_part = myDF.repartition(16)
myDF_part.write.format('jdbc').options(
url='jdbc:oracle:thin:@oraclehost:1521/iINST1',
driver='oracle.jdbc.driver.OracleDriver',
dbtable='test',
batchsize=10000,
numPartitions=16,
user='someuser',
password='somepassword').mode('append').save()
