Topic: "data-deduplication"
dpc/rdedup
Data deduplication engine, supporting optional compression and public key encryption.
Language: Rust - Size: 1010 KB - Last synced at: 4 days ago - Pushed at: over 2 years ago - Stars: 836 - Forks: 45

sail-sg/sailcraft
π’ Data Toolkit for Sailor Language Models
Language: Python - Size: 219 KB - Last synced at: 16 days ago - Pushed at: 2 months ago - Stars: 88 - Forks: 10

jchristn/WatsonDedupe
Self-contained C# library for data deduplication using Sqlite
Language: C# - Size: 3.37 MB - Last synced at: 5 days ago - Pushed at: about 2 years ago - Stars: 36 - Forks: 5

Zabuzard/FastCDC4J
Fast and efficient content-defined chunking for data deduplication. Java implementation of FastCDC as library.
Language: Java - Size: 542 KB - Last synced at: 25 days ago - Pushed at: over 1 year ago - Stars: 21 - Forks: 4

david-siqi-liu/sparklyclean
Optimal distributed data deduplication and supervised learning pipeline using Apache Spark
Language: Scala - Size: 10.1 MB - Last synced at: over 1 year ago - Pushed at: over 4 years ago - Stars: 10 - Forks: 0

bmiller1009/deduper
General deduping engine for JDBC sources with output to JDBC/csv targets
Language: Kotlin - Size: 1.23 MB - Last synced at: 5 months ago - Pushed at: over 4 years ago - Stars: 5 - Forks: 0

gagan3012/PolyDeDupe
PolyDeDupe: Multi-Lingual Data Deduplication
Language: Python - Size: 161 KB - Last synced at: about 11 hours ago - Pushed at: about 12 hours ago - Stars: 2 - Forks: 1

dffdgdg/FindDuplicates
ΠΡΠΎΡ ΠΏΡΠΎΠ΅ΠΊΡ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΡΠΎΠ±ΠΎΠΉ ΠΌΠΎΡΠ½ΡΠΉ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½Ρ Π΄Π»Ρ ΠΏΠΎΠΈΡΠΊΠ° ΠΈ Π°Π½Π°Π»ΠΈΠ·Π° Π΄ΡΠ±Π»ΠΈΡΡΡΡΠΈΡ ΡΡ ΡΠ°ΠΉΠ»ΠΎΠ² Π² ΡΠΊΠ°Π·Π°Π½Π½ΠΎΠΉ Π΄ΠΈΡΠ΅ΠΊΡΠΎΡΠΈΠΈ. ΠΡΠΎΠ³ΡΠ°ΠΌΠΌΠ° ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎ Π²ΡΡΠ²Π»ΡΡΡ ΠΎΠ΄ΠΈΠ½Π°ΠΊΠΎΠ²ΡΠ΅ ΡΠ°ΠΉΠ»Ρ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΈΡ ΡΠΎΠ΄Π΅ΡΠΆΠΈΠΌΠΎΠ³ΠΎ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌ Ρ Π΅ΡΠΈΡΠΎΠ²Π°Π½ΠΈΡ SHA-256. ΠΠ½Π° ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΈΠ²Π°Π΅Ρ Π½Π°ΡΡΡΠΎΠΉΠΊΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠΎΠ², ΡΠ°ΠΊΠΈΡ ΠΊΠ°ΠΊ ΠΌΠΈΠ½ΠΈΠΌΠ°Π»ΡΠ½ΡΠΉ ΡΠ°Π·ΠΌΠ΅Ρ ΡΠ°ΠΉΠ»Π° Π΄Π»Ρ ΠΏΡΠΎΠ²Π΅ΡΠΊΠΈ ΠΈ ΠΈΠ³Π½ΠΎΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½
Language: Python - Size: 0 Bytes - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 1 - Forks: 0

bevry/fellow
Fellow is a package for creating people that can be unified by their shared values via a singleton list on the class
Language: TypeScript - Size: 2.63 MB - Last synced at: 1 day ago - Pushed at: 7 months ago - Stars: 1 - Forks: 0

Anveshika06/VIT-VTAS-TY-2022 Fork of Arunav07/VIT-VTAS-TY-2022
Size: 17.1 MB - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 1 - Forks: 0

shubham-thakare/data-deduplication
A JAVA project that splits data using hashing techniques and removes duplicate blocks to save cloud storage. This project also uses the CloudSim framework for cloud storage simulation.
Language: Java - Size: 640 KB - Last synced at: about 2 years ago - Pushed at: over 4 years ago - Stars: 1 - Forks: 1

anirudh-69/Financial-Data-ETL-Workflow
ETL workflow for stock data processing using Mage and PostgreSQL
Language: Python - Size: 86.9 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Jim-JMCD/TestFilesMake
A test file creator for testing data storage, compression and transfer. It is a small Linux portable executable to create test data with files filled random selectable printable characters or random binary data. There is a sparse file option. No limit on file size or number. Files are created in a single directory.
Size: 42 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

fabriziosalmi/text-boundaries
A Python-based tool for preprocessing, cleaning, and analyzing text datasets, designed to filter, deduplicate, sort data, and generate statistical insights.
Language: Python - Size: 6.94 MB - Last synced at: 22 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 1

KeerthanaPalanikumar/Data-Cleaning-on-SQL
This repository contains SQL scripts and documentation for cleaning and standardizing data in the NashvilleHousing table within the sqlproject2 database. The project aims to prepare the dataset for analysis by addressing inconsistencies, filling missing values, standardizing formats, and removing duplicates.
Size: 5.64 MB - Last synced at: about 2 months ago - Pushed at: 11 months ago - Stars: 0 - Forks: 0

Jim-JMCD/Data_storage_network_deduplication_calculator
A calculator for storage and transmission of deduplicated data presentation in charts and tables
Size: 176 KB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

baraverkstad/mixtape
Practical backups. The Unix toolkit way.
Language: Shell - Size: 678 KB - Last synced at: about 2 years ago - Pushed at: over 7 years ago - Stars: 0 - Forks: 0

therealfun/stcas
Simplest content-addressable storage set of tools to keep space-eficient backups using data deduplication
Last synced at: over 2 years ago - Stars: 0 - Forks: 0