Testing with PySpark
This isn't about details of pySpark. This is about the philosophy of testing when working with a large, complex framework, like pySpark, pandas, numpy, or whatever.
BLUF
Use data subsets.
Write unit tests for the functions that process the data.
Don't test pyspark itself. Test the code you write.