An open API service providing repository metadata for many open source software ecosystems.

GitHub / vim89 / datapipelines-essentials-python

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

JSON API: http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vim89%2Fdatapipelines-essentials-python

Stars: 53
Forks: 34
Open issues: 1

License: apache-2.0
Language: Python
Size: 1.76 MB
Dependencies parsed at: Pending

Created at: over 5 years ago
Updated at: about 1 year ago
Pushed at: about 2 years ago
Last synced at: about 1 year ago

Topics: apache-spark, big-data, data-pipeline, datalake, etl, etl-components, etl-framework, etl-pipeline, hadoop, hadoop-hdfs, hadoop-mapreduce, pyspark, python, python3, spark, spark-sql, xml, xml-parsing

Readme
Loading...