GitHub topics: unstructured-data

Repositories

shcherbak-ai/contextgem

ContextGem: Effortless LLM extraction from documents

Language: Python - Size: 11.4 MB - Last synced at: 43 minutes ago - Pushed at: 5 days ago - Stars: 809 - Forks: 53

nuclia/nucliadb

NucliaDB, The AI Search database for RAG

Language: Python - Size: 40.2 MB - Last synced at: about 10 hours ago - Pushed at: about 10 hours ago - Stars: 695 - Forks: 54

neo4j-labs/llm-graph-builder

Neo4j graph construction from unstructured data using LLMs

Language: Jupyter Notebook - Size: 52.8 MB - Last synced at: 32 minutes ago - Pushed at: about 8 hours ago - Stars: 3,452 - Forks: 586

nomic-ai/nomic

Interact, analyze and structure massive text, image, embedding, audio and video datasets

Language: Python - Size: 24.2 MB - Last synced at: about 3 hours ago - Pushed at: about 24 hours ago - Stars: 1,682 - Forks: 186

iterative/dvc

🦉 Data Versioning and ML Experiments

Language: Python - Size: 19.5 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 14,444 - Forks: 1,210

kuzudb/baml-kuzu-demo

Demo of knowledge graph creation and Graph RAG with BAML and Kuzu

Language: Python - Size: 3.46 MB - Last synced at: about 22 hours ago - Pushed at: 2 months ago - Stars: 31 - Forks: 3

instill-ai/console

📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core

Language: TypeScript - Size: 12.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 39 - Forks: 10

voxel51/fiftyone

Refine high-quality datasets and visual AI models

Language: Python - Size: 1.92 GB - Last synced at: 1 day ago - Pushed at: 3 days ago - Stars: 9,467 - Forks: 629

instill-ai/model-backend

⇋ A REST/gRPC server for Instill Model API service

Language: Go - Size: 19.9 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 17 - Forks: 7

Toschu95/my-benefit-finder-vienna

My Benefit Finder Vienna is an AI-powered system designed to help individuals in Vienna quickly find and apply for relevant social benefits, grants, and subsidies. Using RAG (Retrieval-Augmented Generation) and a Large Language Model (LLM), this tool provides personalized recommendations based on the latest available data from official sources.

Language: Jupyter Notebook - Size: 637 KB - Last synced at: 2 days ago - Pushed at: 2 days ago - Stars: 1 - Forks: 0

JSv4/OpenContracts

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

Language: Python - Size: 124 MB - Last synced at: 3 days ago - Pushed at: 3 days ago - Stars: 854 - Forks: 83

aclai-lab/SoleData.jl

Manage logical datasets!

Language: Julia - Size: 1.88 MB - Last synced at: 4 days ago - Pushed at: 4 days ago - Stars: 13 - Forks: 2

mitdbg/palimpzest

A System for (Optimized) Semantic Computation

Language: Python - Size: 361 MB - Last synced at: about 6 hours ago - Pushed at: 13 days ago - Stars: 108 - Forks: 20

NanoNets/docext

An on-premises, OCR-free unstructured data extraction tool powered by vision language models.

Language: Python - Size: 1.84 MB - Last synced at: 4 days ago - Pushed at: 5 days ago - Stars: 120 - Forks: 8

Zipstack/unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

Language: Python - Size: 32.2 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 5,182 - Forks: 471

Zipstack/unstract-sdk

A framework for writing Unstract Tools/Apps

Language: Python - Size: 3.42 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 19 - Forks: 1

lotus-data/lotus

LOTUS: A semantic query engine for fast and easy LLM-powered data processing

Language: Python - Size: 1.47 MB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 1,173 - Forks: 100

graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

Language: TypeScript - Size: 304 KB - Last synced at: 5 days ago - Pushed at: 5 days ago - Stars: 244 - Forks: 28

CambioML/any-parser

Accurate, private and configurable document retrieval LLM

Language: Python - Size: 22.2 MB - Last synced at: 1 day ago - Pushed at: 24 days ago - Stars: 123 - Forks: 11

databricks/lilac

Curate better data for LLMs

Language: Python - Size: 37 MB - Last synced at: 6 days ago - Pushed at: about 1 year ago - Stars: 1,033 - Forks: 100

tstanislawek/awesome-document-understanding

A curated list of resources for Document Understanding (DU) topic

Size: 5.56 MB - Last synced at: 6 days ago - Pushed at: almost 2 years ago - Stars: 1,405 - Forks: 160

ScrapeGraphAI/Scrapontologies

Python library for Entities, relationships and schemas extraction from documents

Language: Python - Size: 688 KB - Last synced at: 5 days ago - Pushed at: 5 months ago - Stars: 39 - Forks: 2

Francois-lenne/elt-mp4-quiberon

the goal of this project is to retrieve the video of the municipality of quiberon and see if a person is in or no

Language: Python - Size: 38.1 KB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 0 - Forks: 0

amphi-ai/amphi-etl

Visual Data Preparation and Transformation. Low-Code Python-based ETL.

Language: TypeScript - Size: 1.54 MB - Last synced at: 7 days ago - Pushed at: 9 days ago - Stars: 1,052 - Forks: 62

milvus-io/bootcamp

Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.

Language: HTML - Size: 213 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 2,102 - Forks: 626

dingodb/dingo

A multi-modal vector database that supports upserts and vector queries using unified SQL (MySQL-Compatible) on structured and unstructured data, while meeting the requirements of high concurrency and ultra-low latency.

Language: Java - Size: 26.9 MB - Last synced at: 8 days ago - Pushed at: 8 days ago - Stars: 1,434 - Forks: 246

instill-ai/instill-core

🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

Language: Makefile - Size: 10.8 MB - Last synced at: 6 days ago - Pushed at: 7 days ago - Stars: 2,243 - Forks: 112

instill-ai/.github

🏡 Instill AI organisation profile and default configuration

Size: 52.4 MB - Last synced at: 12 days ago - Pushed at: 12 days ago - Stars: 1 - Forks: 0

kodexa-ai/kodexa-cli

Command Line Tools for Kodexa

Language: Python - Size: 1.15 MB - Last synced at: 6 days ago - Pushed at: 6 days ago - Stars: 1 - Forks: 1

RelevanceAI/relevanceai

Home of the AI workforce - Multi-agent system, AI agents & tools

Language: Python - Size: 68.9 MB - Last synced at: 9 days ago - Pushed at: 9 days ago - Stars: 230 - Forks: 34

instill-ai/cli

⌨️ Instill CLI for 🔮 Instill Core: https://github.com/instill-ai/instill-core

Language: Go - Size: 630 KB - Last synced at: 14 days ago - Pushed at: 14 days ago - Stars: 22 - Forks: 3

instill-ai/pipeline-backend

⇋ A REST/gRPC server for Instill VDP API service

Language: Go - Size: 74 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 26 - Forks: 21

instill-ai/mgmt-backend

⇋ A REST/gRPC server for Instill AI's Management API service

Language: Go - Size: 1.15 MB - Last synced at: 1 day ago - Pushed at: 1 day ago - Stars: 1 - Forks: 2

Nan-Shen/Precise_RAG

precisely retrieve information from pdf file

Language: Jupyter Notebook - Size: 1.62 MB - Last synced at: 17 days ago - Pushed at: 17 days ago - Stars: 0 - Forks: 0

instill-ai/artifact-backend

⇋ A REST/gRPC server for Instill Artifact API service

Language: Go - Size: 1.05 MB - Last synced at: 19 days ago - Pushed at: 20 days ago - Stars: 0 - Forks: 3

garyelephant/pygrok

python implementation of jordansissel's grok regular expression library

Language: Python - Size: 66.4 KB - Last synced at: 3 days ago - Pushed at: over 1 year ago - Stars: 279 - Forks: 75

EulerSearch/embedding_studio

Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

Language: Python - Size: 10.2 MB - Last synced at: 20 days ago - Pushed at: 21 days ago - Stars: 380 - Forks: 5

KatelynFaulkner/rsa-unstructured-data-comp

Scripts that compare aggregated cubes with structured monitoring schemes in South Africa

Language: HTML - Size: 13.1 MB - Last synced at: 21 days ago - Pushed at: 21 days ago - Stars: 0 - Forks: 0

towhee-io/towhee

Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

Language: Python - Size: 37.2 MB - Last synced at: 20 days ago - Pushed at: 7 months ago - Stars: 3,358 - Forks: 258

instill-ai/helm-charts

⎈ The Helm charts of Instill AI

Size: 205 KB - Last synced at: 27 days ago - Pushed at: 28 days ago - Stars: 2 - Forks: 1

BartJongejan/Bracmat

Programming language for symbolic computation with unusual combination of pattern matching features: Tree patterns, associative patterns and expressions embedded in patterns.

Language: C - Size: 23.9 MB - Last synced at: 4 days ago - Pushed at: about 1 month ago - Stars: 47 - Forks: 5

Renumics/spotlight

Interactively explore unstructured datasets from your dataframe.

Language: TypeScript - Size: 46.8 MB - Last synced at: 30 days ago - Pushed at: 3 months ago - Stars: 1,164 - Forks: 86

yobix-ai/extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

Language: Rust - Size: 2.88 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 1,051 - Forks: 43

teragrep/rsm_01

Teragrep record schema mapper library for Java

Language: Java - Size: 53.7 KB - Last synced at: 21 days ago - Pushed at: about 1 month ago - Stars: 0 - Forks: 3

osllmai/inDox

The Indox Ecosystem offers integrated AI tools for data workflows. Our four components (IndoxArcg, IndoxMiner, IndoxJudge, and IndoxGen) enhance AI applications with advanced retrieval, extraction, evaluation, and generation capabilities, supporting multiple document formats and LLM providers.

Language: Jupyter Notebook - Size: 106 MB - Last synced at: 25 days ago - Pushed at: 25 days ago - Stars: 20 - Forks: 2

harishdeivanayagam/rowfill

Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

Language: TypeScript - Size: 1.2 MB - Last synced at: 24 days ago - Pushed at: about 2 months ago - Stars: 275 - Forks: 14

velocitybolt/open-extract

Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.

Language: Python - Size: 8.91 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 162 - Forks: 13

b-cubed-eu/comp-unstructured-data

Scripts to explore the conditions that determine the reliability of models, trends and status by comparing aggregated cubes with structured monitoring schemes

Language: R - Size: 1.69 MB - Last synced at: 29 days ago - Pushed at: about 1 month ago - Stars: 2 - Forks: 0

THANGGI02/graph-rag

UltraRepo Graph RAG provides AI agents access to massive code, doc, and data repos via Knowledge Graphs (KG). KGs are generated in Neo4j and accessible via FastAPI and vector DBs. Provides AI agents with better accuracy, scalability, and reasoning over large repos.

Language: Jupyter Notebook - Size: 10 MB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

tiangenglu/data_wrangling

ETL-pipelines for structured and unstructured data, data wrangling worked examples, automatic data workflows

Language: Python - Size: 393 KB - Last synced at: about 2 months ago - Pushed at: about 2 months ago - Stars: 0 - Forks: 0

Thehousummer233/wikipedia-ai-agent

Wikipedia AI agent research assistant. LangChain's LangGraph's ReAct agent architecture, LLMs (OpenAI, Anthropic, Google), Wikipedia API, RAG with FAISS vector db, semantic chunking, GraphRAG, Streamlit frontend, terminal and web interfaces

Size: 1.95 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

NTDLS/Katzebase

ACID compliant document-based database engine with SQL language, APIs and Management UI.

Language: C# - Size: 33.3 MB - Last synced at: 30 days ago - Pushed at: 2 months ago - Stars: 6 - Forks: 1

AnhDungPham2901/extract_data_from_pdf

Using LLM to extract unstructured data from pdf file into structured format

Language: Jupyter Notebook - Size: 217 KB - Last synced at: 2 months ago - Pushed at: 2 months ago - Stars: 0 - Forks: 0

Clarifai/clarifai-python-datautils

Extract Transform and Load unstructured data into the Clarifai's AI platform

Language: Python - Size: 1.05 MB - Last synced at: about 1 month ago - Pushed at: about 1 month ago - Stars: 6 - Forks: 0

FroCode/Realtime_Streaming_Unstructured-Data

Real-time streaming and processing of unstructured data (spark, airflow)

Language: Python - Size: 128 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 1 - Forks: 0

libraryofcelsus/LLM_File_Parser

AutoML/Unstructured Data Processing for RAG and LLM Dataset Creation. Current Database Options are: Qdrant or Marqo DB.

Language: Python - Size: 43 KB - Last synced at: 3 days ago - Pushed at: 10 months ago - Stars: 6 - Forks: 1

SupermatAI/supermat

Novel data representation leading to granular citations and higher accuracy

Language: Python - Size: 5.57 MB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 4 - Forks: 1

kodexa-ai/kodexa

Kodexa Python Client

Language: Python - Size: 10.8 MB - Last synced at: 7 days ago - Pushed at: 7 days ago - Stars: 4 - Forks: 1

lazyhope/metamodel

Intelligent Schema Designer and Unstructured Data Parser

Language: JavaScript - Size: 164 KB - Last synced at: 3 months ago - Pushed at: 3 months ago - Stars: 13 - Forks: 0

pintamonas4575/GESTBD-project-MAADM-UPM

Proyecto de "Gestión de sistemas de datos masivos" de máster de la UPM.

Language: Jupyter Notebook - Size: 1.48 MB - Last synced at: about 1 month ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

teragrep/blf_01

Tokenizer for Teragrep

Language: Java - Size: 9.17 MB - Last synced at: 21 days ago - Pushed at: 7 months ago - Stars: 0 - Forks: 4

teragrep/dpf_03

Teragrep Tokenizer for Apache Spark

Language: Scala - Size: 78.1 KB - Last synced at: 21 days ago - Pushed at: 6 months ago - Stars: 0 - Forks: 4

SachinKalsi/html_tag_annotator

A Machine Learning tool to create the training dataset very quickly & easily by using a smart chrome extension

Language: JavaScript - Size: 11.8 MB - Last synced at: about 1 month ago - Pushed at: over 2 years ago - Stars: 14 - Forks: 3

drci-foch/BTB_extraction

Transbronchial Biopsy Document restructuration. Work in progress.

Language: Jupyter Notebook - Size: 93.6 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

nicbet/infozilla

The infoZilla unstructured software engineering data mining tool. It can find and extract source code regions, patches, stack traces, enumerations and itemizations from discussion threads.

Language: Java - Size: 530 KB - Last synced at: about 1 month ago - Pushed at: over 6 years ago - Stars: 15 - Forks: 2

DavidMoserAI/AzureDocumentIntelligenceChunker

A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.

Language: Python - Size: 24.4 KB - Last synced at: 21 days ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

instill-ai/controller-vdp 📦

🎮 A controller-vdp manages components in Instill VDP

Language: Go - Size: 316 KB - Last synced at: 24 days ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 1

tinaland101/UK-Food-Directory-Project

The core of this project is based on analyzing data from the UK Food Standards Agency. This data includes food hygiene ratings of various establishments across the UK. Based on the performance ratings of data the results are chosen for casting a popular food choices.

Language: Jupyter Notebook - Size: 16.6 KB - Last synced at: 2 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

Zipstack/unstract-adapters

Unstract's interface to LLMs, Embeddings and VectorDBs.

Language: Python - Size: 632 KB - Last synced at: 10 days ago - Pushed at: 10 months ago - Stars: 18 - Forks: 3

IBM/pixiedust-facebook-analysis 📦

A Jupyter notebook that uses the Watson Visual Recognition and Natural Language Understanding services to enrich Facebook Analytics and uses Cognos Dashboard Embedded to explore and visualize the results in Watson Studio

Language: Jupyter Notebook - Size: 6.67 MB - Last synced at: 7 days ago - Pushed at: 5 months ago - Stars: 44 - Forks: 64

garethcmurphy/Managing-Unstructured-Metadata-at-ESS

What is metadata? a set of data that describes and gives information about other data. Can classify into separate types administrative structural descriptive scientific SCIENTIFIC METADATA … is often notoriously incomplete. Additional quantities and assumptions necessary to interpret the data may initially only be recorded on scraps of paper, har

Language: CSS - Size: 8.12 MB - Last synced at: 4 months ago - Pushed at: 4 months ago - Stars: 0 - Forks: 0

faisalman/re-parse-js

Compose a structured data from unstructured text using regex-based pattern matching, as found in UAParser.js

Language: TypeScript - Size: 31.3 KB - Last synced at: 25 days ago - Pushed at: 5 months ago - Stars: 1 - Forks: 1

ewdlop/X-File

https://en.wikipedia.org/wiki/The_X-Files

Size: 166 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

wangxb96/RAG-QA-Generator

RAG-QA-Generator 是一个用于检索增强生成（RAG）系统的自动化知识库构建与管理工具。该工具通过读取文档数据，利用大规模语言模型生成高质量的问答对（QA对），并将这些数据插入数据库中，实现RAG系统知识库的自动化构建和管理。

Language: Python - Size: 1.72 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 62 - Forks: 6

SalmaSalahEldin/RAG-Powered-Educational-Assistant

Size: 54.7 KB - Last synced at: 3 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

DerwenAI/cdl2024_masterclass

Connected Data London 2024, ERKG masterclass: how to generate knowledge graphs from structured and unstructured data based on entity resolution (ER) to enhance data quality for the downstream AI applications

Size: 81.1 KB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 0 - Forks: 0

kaloslazo/PyFuseDB

Database system that combines structured data retrieval through inverted indexes with unstructured data (images, audio) search using multidimensional vector embeddings, all within a unified platform.

Language: Python - Size: 631 MB - Last synced at: about 1 month ago - Pushed at: 5 months ago - Stars: 3 - Forks: 0

floriancochard/extract-data-from-paper

A tool designed to extract numerical data from scanned historical weather documents.

Language: Python - Size: 151 MB - Last synced at: 5 months ago - Pushed at: 5 months ago - Stars: 13 - Forks: 2

MoinDalvs/Resume_Screening_and_Parser

Business objective- The document classification solution should significantly reduce the manual human effort in the HRM. It should achieve a higher level of accuracy and automation with minimal human intervention Sample Data Set Details: Resumes and financial documents

Language: Jupyter Notebook - Size: 95.9 MB - Last synced at: 20 days ago - Pushed at: over 2 years ago - Stars: 7 - Forks: 2

IBM/generate-insights-from-data-formats-with-watson 📦

How do we process data in different formats like docx, pdf etc and generate insights to be linked with structured data in database?This pattern helps in establishing relations between structured & unstructured data to generate recommendations using Watson NLU & Watson Studio.

Language: Jupyter Notebook - Size: 1.06 MB - Last synced at: 7 days ago - Pushed at: almost 5 years ago - Stars: 14 - Forks: 14

am1tyadav/cosmonaut

Helping you find structure in the cosmos of data.

Language: Python - Size: 83 KB - Last synced at: 6 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

Shivabajelan/uploading_file_to_azure_blob_using_python

In this repository, I will show how we can automate uploading unstructured data such as pdf or png files to Azure Blob using Python.

Size: 28.3 KB - Last synced at: 2 months ago - Pushed at: 6 months ago - Stars: 0 - Forks: 0

shay681/Constructing-Structured-Database-from-Unstructured-Legal-Documents

This project aims to compare 3 methods for transforming unstructured textual content from Hebrew legal documents into structured data

Language: Jupyter Notebook - Size: 68.4 KB - Last synced at: 7 months ago - Pushed at: 7 months ago - Stars: 0 - Forks: 0

hupe1980/go-textractor

📄 Amazon textract response parser written in go.

Language: Go - Size: 6.24 MB - Last synced at: 28 days ago - Pushed at: over 1 year ago - Stars: 4 - Forks: 0

ShreyanSimhadri/21BKT0102_ML

LLM Models on Unstructured Data

Language: Python - Size: 6.84 KB - Last synced at: 6 months ago - Pushed at: over 1 year ago - Stars: 0 - Forks: 0

krishcy25/SentimentMining-UsingPython-WordCloud-and-TextHero

Sentiment Mining (Unstructured data)- This repository focuses on Creating a Word Cloud (with most frequent/significant words) and Created list of top words by product, K-Means and PCA plot for the reviews based on category of topics as pulled by the textual review analysis of Amazon Customer Reviews on Electronic Products

Language: Jupyter Notebook - Size: 3.85 MB - Last synced at: 8 months ago - Pushed at: almost 5 years ago - Stars: 0 - Forks: 0

DerwenAI/strwythura

How to construct knowledge graphs from unstructured data sources

Language: Jupyter Notebook - Size: 1.22 MB - Last synced at: 8 months ago - Pushed at: 8 months ago - Stars: 67 - Forks: 6

wasay8/AutomatedGarbageImageClassifier

Implementation of CNN models(Resnet-34 and Resnet-50) to classify garbage images into 6 major categories for sustainable development and its disposability.

Language: Python - Size: 8.79 KB - Last synced at: about 2 months ago - Pushed at: about 1 year ago - Stars: 0 - Forks: 0

yeisonmontoya1815/Special-Topics-in-Data-Analytics

In my PDD Data Analytics studies at Douglas College, the Special Topics course stands out as a crucial component. This specialized module delves into advanced aspects of data analysis beyond the core curriculum, offering a deep exploration of intricate domains. Through this focused study, I aim to enhance my proficiency in handling complex datasets

Language: Jupyter Notebook - Size: 15.2 MB - Last synced at: about 2 months ago - Pushed at: 9 months ago - Stars: 1 - Forks: 0