An open API service providing repository metadata for many open source software ecosystems.

Topic: "unstructured-data"

abdollahpour/micro-draft-manager

micro-draft-manager is a microservice that helps you to manage unstructured data in your application with sorting and full-text search

Language: Go - Size: 27.3 KB - Last synced at: almost 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 0

bengruher/SMS-Spam-Detection

Machine learning task to identify spam SMS messages. Project involves processing of noisy unstructured text and other NLP techniques.

Language: Jupyter Notebook - Size: 663 KB - Last synced at: over 2 years ago - Pushed at: over 3 years ago - Stars: 1 - Forks: 1

mware-solutions/bigconnect-docs

Documentation for the BigConnect platform

Size: 5.64 MB - Last synced at: over 2 years ago - Pushed at: over 5 years ago - Stars: 1 - Forks: 0

rudrakshsyal/Craigslist-Job-Listing-Transformation-via-Text-Modeling

Improved quality and presentation of job listings on Craigslist website via scraping and training data from Indeed’s job listings’, to enhance user experience, drive more traffic and thus increase revenue

Language: Jupyter Notebook - Size: 4.54 MB - Last synced at: over 1 year ago - Pushed at: over 6 years ago - Stars: 1 - Forks: 0

ihaterynn/Docling-Processor

Document Processing Script using Docling

Language: Python - Size: 4.03 MB - Last synced at: 2 days ago - Pushed at: 3 days ago - Stars: 0 - Forks: 0

instill-ai/artifact-backend

⇋ A REST/gRPC server for Instill Artifact API service

Language: Go - Size: 1.32 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 0 - Forks: 3

Analyst-Lochan/employee-health-analysis

This project showcases a complete data cleaning and basic analytics workflow on a real-world-style employee health dataset, simulating inconsistencies often found in raw data. It includes both uncleaned and cleaned Excel files, plus a pivot-based dashboard to derive insights.

Size: 3.53 MB - Last synced at: 15 days ago - Pushed at: 15 days ago - Stars: 0 - Forks: 0

lightup-data/lightudq

AI assisted data quality for unstructured data

Language: Python - Size: 1.21 MB - Last synced at: 20 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 0

samnaveenkumaroff/CuraOS

CuraOS is a fully modular, AI-powered pipeline that automates the transformation of unstructured multi-page medical records (PDFs, scanned documents) into structured and actionable electronic health records (EHRs).

Language: Python - Size: 896 KB - Last synced at: 22 days ago - Pushed at: 22 days ago - Stars: 0 - Forks: 1

spoortimorabad/Personally-Identifiable-Information-PII-

Detecting Personal Information and Masking Method

Language: Python - Size: 8.13 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 0

Francois-lenne/elt-mp4-quiberon

the goal of this project is to retrieve the video of the municipality of quiberon and see if a person is in or no

Language: Python - Size: 38.1 KB - Last synced at: about 1 month ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

b-cubed-eu/rsa-unstructured-data-comp

Scripts that compare aggregated cubes with structured monitoring schemes in South Africa

Language: R - Size: 13.1 MB - Last synced at: 27 days ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

Nan-Shen/Precise_RAG

precisely retrieve information from pdf file

Language: Jupyter Notebook - Size: 1.62 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 0 - Forks: 0

teragrep/rsm_01

Teragrep record schema mapper library for Java

Language: Java - Size: 53.7 KB - Last synced at: 3 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 3

THANGGI02/graph-rag

UltraRepo Graph RAG provides AI agents access to massive code, doc, and data repos via Knowledge Graphs (KG). KGs are generated in Neo4j and accessible via FastAPI and vector DBs. Provides AI agents with better accuracy, scalability, and reasoning over large repos.

Language: Jupyter Notebook - Size: 10 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

tiangenglu/data_wrangling

ETL-pipelines for structured and unstructured data, data wrangling worked examples, automatic data workflows

Language: Python - Size: 393 KB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

Thehousummer233/wikipedia-ai-agent

Wikipedia AI agent research assistant. LangChain's LangGraph's ReAct agent architecture, LLMs (OpenAI, Anthropic, Google), Wikipedia API, RAG with FAISS vector db, semantic chunking, GraphRAG, Streamlit frontend, terminal and web interfaces

Size: 1.95 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

AnhDungPham2901/extract_data_from_pdf

Using LLM to extract unstructured data from pdf file into structured format

Language: Jupyter Notebook - Size: 217 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

drci-foch/BTB_extraction

Transbronchial Biopsy Document restructuration. Work in progress.

Language: Jupyter Notebook - Size: 93.6 MB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

DavidMoserAI/AzureDocumentIntelligenceChunker

A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.

Language: Python - Size: 24.4 KB - Last synced at: about 2 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 1

garethcmurphy/Managing-Unstructured-Metadata-at-ESS

What is metadata? a set of data that describes and gives information about other data. Can classify into separate types administrative structural descriptive scientific SCIENTIFIC METADATA … is often notoriously incomplete. Additional quantities and assumptions necessary to interpret the data may initially only be recorded on scraps of paper, har

Language: CSS - Size: 8.12 MB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

tinaland101/UK-Food-Directory-Project

The core of this project is based on analyzing data from the UK Food Standards Agency. This data includes food hygiene ratings of various establishments across the UK. Based on the performance ratings of data the results are chosen for casting a popular food choices.

Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: 5 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

SalmaSalahEldin/RAG-Powered-Educational-Assistant

Size: 54.7 KB - Last synced at: about 1 month ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

DerwenAI/cdl2024_masterclass

Connected Data London 2024, ERKG masterclass: how to generate knowledge graphs from structured and unstructured data based on entity resolution (ER) to enhance data quality for the downstream AI applications

Size: 81.1 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

am1tyadav/cosmonaut

Helping you find structure in the cosmos of data.

Language: Python - Size: 83 KB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 0 - Forks: 0

pintamonas4575/GESTBD-project-MAADM-UPM

Proyecto de "Gestión de sistemas de datos masivos" de máster de la UPM.

Language: Jupyter Notebook - Size: 1.48 MB - Last synced at: 4 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

Shivabajelan/uploading_file_to_azure_blob_using_python

In this repository, I will show how we can automate uploading unstructured data such as pdf or png files to Azure Blob using Python.

Size: 28.3 KB - Last synced at: 18 days ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

teragrep/dpf_03

Teragrep Tokenizer for Apache Spark

Language: Scala - Size: 78.1 KB - Last synced at: 3 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 4

shay681/Constructing-Structured-Database-from-Unstructured-Legal-Documents

This project aims to compare 3 methods for transforming unstructured textual content from Hebrew legal documents into structured data

Language: Jupyter Notebook - Size: 68.4 KB - Last synced at: 9 months ago - Pushed at: 9 months ago - Stars: 0 - Forks: 0

teragrep/blf_01

Tokenizer for Teragrep

Language: Java - Size: 9.17 MB - Last synced at: 3 months ago - Pushed at: 10 months ago - Stars: 0 - Forks: 4

SC92113/User-Analytics

My 'Out of PM scopes' data project

Language: Jupyter Notebook - Size: 3.14 MB - Last synced at: 12 months ago - Pushed at: 12 months ago - Stars: 0 - Forks: 0

nagababumo/Preprocessing-Unstructured-Data-for-LLM-Applications

Language: Jupyter Notebook - Size: 37.1 KB - Last synced at: 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 2

MohitWani/Unstructured-data-preprocessing-

This repository contain preprocessing of Unstructured data, Like Images, text, speech and etc....

Language: Jupyter Notebook - Size: 1.76 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

instill-ai/controller-model

🎮 A controller-model manages components in Instill Model

Language: Go - Size: 351 KB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 1

NityaVerma19/Cats-vs-Dogs

Classifying 😺 and 🐶 using CNN

Language: Jupyter Notebook - Size: 2.85 MB - Last synced at: about 1 year ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

wasay8/AutomatedGarbageImageClassifier

Implementation of CNN models(Resnet-34 and Resnet-50) to classify garbage images into 6 major categories for sustainable development and its disposability.

Language: Python - Size: 8.79 KB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

jovezhong/real-time-milvus Fork of bytewax/real-time-milvus

Streaming meets LLM: Real-time Hacker News to Milvus/Zilliz with streaming SQL

Language: Python - Size: 2.27 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

airdac/MUD

Subject repository with NLP Python apps. UPC - Master's Degree in Data Science - Mining Unstructured Data - Spring 2024

Language: Jupyter Notebook - Size: 10.7 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

instill-ai/deprecated-vdp

💧 Instill VDP (Versatile Data Pipeline) is an open-source tool to seamlessly integrate AI to process unstructured data in the modern data stack

Language: Makefile - Size: 7.9 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

martinbatek/IC-UDA-Final-Project

Final Project for the Unstructured Data Analysis module in the MSc. Machine Learning and Data Science Course

Language: Jupyter Notebook - Size: 500 MB - Last synced at: 4 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

mazzasaverio/terra-text-processor

A Terraform setup for processing unstructured data on GCP with MongoDB Atlas and Confluent Kafka, featuring serverless, event-driven architecture and Cloud Run integrations.

Language: HCL - Size: 17.6 KB - Last synced at: 5 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

instill-ai/controller-vdp 📦

🎮 A controller-vdp manages components in Instill VDP

Language: Go - Size: 316 KB - Last synced at: 3 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

ShreyanSimhadri/21BKT0102_ML

LLM Models on Unstructured Data

Language: Python - Size: 6.84 KB - Last synced at: 9 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

kodexa-ai/kodexa-java

Kodexa Content Model and Client for Java

Language: Java - Size: 18.3 MB - Last synced at: over 1 year ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

KamRoki/Deep-Learning-Dog-Breed

Who's a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don't have all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a four-legged stranger: what kind of good pup is that? This notebook builds a multi-class image classifier using TensorFlow 2.0 and TensorFlow Hub.

Language: Jupyter Notebook - Size: 6.1 MB - Last synced at: over 1 year ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 1

inuwamobarak/detecting-tables-in-documents

This repository contains code and resources for detecting tables in various types of documents using machine learning and computer vision techniques.

Language: Jupyter Notebook - Size: 1.8 MB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

janoellerich/RooTri

Language: MATLAB - Size: 124 KB - Last synced at: almost 2 years ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

perebaj/parser

Parse Unstructure text using GPT3 API

Language: Go - Size: 1.75 MB - Last synced at: 2 months ago - Pushed at: almost 2 years ago - Stars: 0 - Forks: 0

instill-ai/controller 📦

🎮 A controller to management all VDP states

Language: Go - Size: 281 KB - Last synced at: 8 months ago - Pushed at: about 2 years ago - Stars: 0 - Forks: 1

ujunwa-DS/UNSTRUCTURED-DATA-WHATSAPP-DATA-

WhatsApp Unstructured data was cleaned with python and visualized with Power BI to obtain insight. Libraries like Numpy, Regex, openpyxl, pandas were used in this project

Language: Jupyter Notebook - Size: 209 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

instill-ai/metric-backend 📦

⇋ A REST/gRPC server for Instill AI's Metric API service

Size: 0 Bytes - Last synced at: about 1 year ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

ClaudioPoli/JobAds

Management of structured and unstructured data

Language: PLpgSQL - Size: 30.3 KB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

Mihryam/HealthNews_Tweets-ClusteringToClassification

A machine learning model on clustering of health news tweets from different news sources to extrapolate categories and then use the cluster labels for downstream classification.

Language: Jupyter Notebook - Size: 4.45 MB - Last synced at: almost 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

pedrogfleming/Snowflake-Scripts

SQL Scripts related to my learning on the Snowflake data cloud provider

Size: 3.7 MB - Last synced at: over 2 years ago - Pushed at: over 2 years ago - Stars: 0 - Forks: 0

branham-player/indexer

A parser which indexes unstructured collections of data representing William Branham's complete sermon library and structures them for loading into a data ingester

Language: JavaScript - Size: 38.1 KB - Last synced at: over 2 years ago - Pushed at: almost 3 years ago - Stars: 0 - Forks: 0

Peteresis/Movies-ETL

ETL (Extract, Transform, Load) Practice. Automate the process of reading new data, processing it, and then loading it into new SQL tables. The code uses Python, RegEx, and a SQL database to build an ETL pipeline for this project.

Language: Jupyter Notebook - Size: 2.99 MB - Last synced at: 10 days ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 1

oypark/Unstructured-data-analysis-Project

멀캠 프로젝트2_비정형 데이터 분석(mulcam bigdata project2_unstructured data analysis)

Language: Jupyter Notebook - Size: 19.6 MB - Last synced at: about 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

lilianchi/lost-or-found

A repository with our team's final Python project in MGMT 590 Analyzing Unstructured Data course at Krannert School of Management, Purdue University.

Language: Python - Size: 1.44 MB - Last synced at: over 1 year ago - Pushed at: over 3 years ago - Stars: 0 - Forks: 0

elalbaicin/progRchives

An R package for scraping and organizing ProgArchives data.

Language: R - Size: 3.49 MB - Last synced at: about 1 year ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

AsishMandoi/quantum-search

A quantum circuit that takes a list of numbers and returns a quantum state which is a superposition of indices of those numbers that follow a given pattern

Language: Jupyter Notebook - Size: 919 KB - Last synced at: over 2 years ago - Pushed at: almost 4 years ago - Stars: 0 - Forks: 0

sdurancmu/disaster_tweets

Multiple approaches to predicting disaster tweets on Kaggle dataset

Language: Jupyter Notebook - Size: 133 MB - Last synced at: 6 months ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 1

bartczernicki/Documents-Forms

Collection of various documents and forms that can be used by AI services & systems for training

Size: 26.2 MB - Last synced at: over 2 years ago - Pushed at: about 4 years ago - Stars: 0 - Forks: 0

bhattsahil1/smart-xtractor

Language: Python - Size: 3.45 MB - Last synced at: about 1 year ago - Pushed at: over 4 years ago - Stars: 0 - Forks: 1

roshni-b/Log-Parser

Modular log parser that parses @nasa's apache logs and processes them.

Language: Python - Size: 30.3 MB - Last synced at: about 1 year ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

krishcy25/SentimentMining-UsingPython-WordCloud-and-TextHero

Sentiment Mining (Unstructured data)- This repository focuses on Creating a Word Cloud (with most frequent/significant words) and Created list of top words by product, K-Means and PCA plot for the reviews based on category of topics as pulled by the textual review analysis of Amazon Customer Reviews on Electronic Products

Language: Jupyter Notebook - Size: 3.85 MB - Last synced at: 11 months ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

as2leung/web_scrape_postal_office_address

A web scraping project that retrieves the post office locations from a search engine result and outputs the data in a cleaned dataframe

Language: Jupyter Notebook - Size: 35.2 KB - Last synced at: about 2 years ago - Pushed at: about 5 years ago - Stars: 0 - Forks: 0

rgdeekshith/zero-to-mastery-ml Fork of mrdbourke/zero-to-mastery-ml

All course materials for ZTM ML on Udemy

Size: 129 MB - Last synced at: almost 2 years ago - Pushed at: about 5 years ago - Stars: 0 - Forks: 0

tejasshahu/Data_Science_Machine_Learning

This repository is all about Data Science and Machine Learning.

Language: Jupyter Notebook - Size: 33.7 MB - Last synced at: about 1 year ago - Pushed at: over 5 years ago - Stars: 0 - Forks: 0

wotchin/PostVector

PostVector: unstructured and vector retrieval database extension to PostgreSQL.

Size: 13.7 KB - Last synced at: over 2 years ago - Pushed at: about 6 years ago - Stars: 0 - Forks: 0

jaydeepdevda/NLP-AccessingTextData

Python code to access Large text ( At least 10 pages) from a .txt file, MS Word Document, PDF file, Wikipedia page, 500 tweets.

Language: HTML - Size: 750 KB - Last synced at: about 2 years ago - Pushed at: over 6 years ago - Stars: 0 - Forks: 1

rosette-api-community/rosette-for-docs

Google Docs add-on offering users the ability to extract entities, translate names, and research entities on wikipedia from within their multilingual document.

Language: JavaScript - Size: 18.6 KB - Last synced at: 5 months ago - Pushed at: about 9 years ago - Stars: 0 - Forks: 1