FullStackML

Data Science, Machine Learning, Coding & Tools [by Dmitry Petrov]

Follow publication

How to export data-frame from Apache Spark

--

Apache Spark is a great tool for working with a large amount of data like terabytes and petabytes in a cluster. It’s also very useful in local machine when gigabytes of data do not fit your memory. Normally we use Spark for preparing data and very basic analytic tasks. However, it is not advanced analytical features or even visualization. So, therefore, you have to reduce the amount of data to fit your computer memory capacity. It turns out that Apache Spark still lack the ability to export data in a simple format like CSV.

The image was taken from this web page

1. spark-csv library

I was really surprised when I realized that Spark does not have a CSV exporting features from the box. It turns out that CSV library is an external project. This is must-have library for Spark and I find it funny that this appears to be a marketing plug for Databricks than an Apache Spark project.

Another surprise is this library does not create one single file. It creates several files based on the data frame partitioning. This means that for one single data-frame it creates several CSV files. I understand that this is good for optimization in a distributed environment but you don’t need this to extract data to R or Python scripts.

2. Export from data-frame to CSV

Let’s take a closer look to see how this library works and export CSV from data-frame.

You should include this library in your Spark environment. From spark-shell just add — packages parameter:

This code creates a directory myfile.csv with several CSV files and metadata files. If you need single CSV file, you have to implicitly create one single partition.

We should export data the directory with Parquet data, more CSV to the correct place and remove the directory with all the files. Let’s automate this process:

Conclusion

Apache Spark has many great aspects about it. At this time it cannot be the be-all answer. Usually, you have to pair Spark with your analytical tools like R or Python. However, improvement are constantly being made.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in FullStackML

Data Science, Machine Learning, Coding & Tools [by Dmitry Petrov]

Written by Dmitry Petrov

Creator of http://dvc.org — Git for ML. Ex-Data Scientist @Microsoft. PhD in CS. Making jokes with a serious face.

Responses (1)

Write a response