What is the difference between RDD and DataFrame?

Answer»

RDD	Dataframe
It is the representation of a SET of records and an immutable collection of objects within distributed computing.	It is used for storing data and is basically the equivalent to a table in a relational database with more precious optimization.
This is an array of reference for PARTITIONED objects by representing a large set of data.	It is a distributed collection of data in the form of named rows and COLUMNS
Here all the datasets are logically partitioned across servers to be computed across different nodes in a cluster.	It has a matrix-like structure with different types of columns, such as numeric, logical, and so on.
This supports compile-time type safety, having been based on Object-Oriented Programming.	If there is a non-existent column that the user tries to access, there is an attribute error but no scope for compile-time type safety.
Almost all data sources are supported by RDD	Dataframes require data sources to be in the JSON, CSV, or AVRO format, whereas storage systems having HIVE, HDFS, or MySQL tables.

Discussion