There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
-
Arguably
DataFrame
queries are much easier to construct programmatically and provide a minimal type safety. -
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With
HiveContext
, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).