Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: tika

dadoonet/fscrawler

Elasticsearch File System Crawler (FS Crawler)

Language: Java - Size: 14.6 MB - Last synced: about 5 hours ago - Pushed: about 5 hours ago - Stars: 1,309 - Forks: 294

apache/tika-helm

A Helm chart to deploy Apache Tika on Kubernetes.

Language: Smarty - Size: 86.9 KB - Last synced: about 11 hours ago - Pushed: about 19 hours ago - Stars: 20 - Forks: 15

OpenSextant/Xponents

Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.

Language: Java - Size: 78.5 MB - Last synced: 24 days ago - Pushed: 25 days ago - Stars: 42 - Forks: 7

sesam-community/content-extractor Fork of sesam-io/content-extraction-service

Extract textual information using the Apache Tika library from JSON streams

Language: Java - Size: 23.4 KB - Last synced: 2 days ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0

apache/tika-docker

Convenience Docker images for Apache Tika Server

Language: Shell - Size: 95.7 KB - Last synced: about 10 hours ago - Pushed: about 1 month ago - Stars: 102 - Forks: 58

albertus82/extfix

File Extension Fix Tool - Find and rename files with wrong extensions.

Language: Java - Size: 10.9 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 0 - Forks: 0

shelfio/tika-text-extract

Extract text from a document by Apache Tika

Language: TypeScript - Size: 318 KB - Last synced: 4 days ago - Pushed: 5 days ago - Stars: 15 - Forks: 4

kairohm/tikatree

Directory tree metadata parser using Apache Tika

Language: Python - Size: 42 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 3 - Forks: 0

TYPO3-Solr/ext-tika

A TYPO3 CMS extension that provides Apache Tika functionality

Language: PHP - Size: 2.07 MB - Last synced: 3 days ago - Pushed: 7 days ago - Stars: 6 - Forks: 29

kestra-io/plugin-tika

Language: Java - Size: 3.41 MB - Last synced: 7 days ago - Pushed: 8 days ago - Stars: 2 - Forks: 2

AlexsJones/kubernetes-apache-tika

Apache tika the attachment processor

Language: Shell - Size: 3.91 KB - Last synced: 8 days ago - Pushed: over 5 years ago - Stars: 1 - Forks: 1

commitd/krill

Improved HTML output for Tika extraction

Language: Java - Size: 1.92 MB - Last synced: 9 days ago - Pushed: over 1 year ago - Stars: 4 - Forks: 2

rse/tika-server

Apache Tika Server as a Background Service in Node.js

Language: JavaScript - Size: 75.2 KB - Last synced: 8 days ago - Pushed: about 1 month ago - Stars: 18 - Forks: 5

hmmh/typo3-solr-file-indexer

TYPO3 Extension: solr_file_indexer

Language: PHP - Size: 466 KB - Last synced: 13 days ago - Pushed: 7 months ago - Stars: 9 - Forks: 6

bcgov/nr-bcws-opensearch

opensearch related code

Language: Java - Size: 395 MB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 1 - Forks: 7

ICIJ/extract

A cross-platform command line tool for parallelised content extraction and analysis.

Language: Java - Size: 69.4 MB - Last synced: 16 days ago - Pushed: 16 days ago - Stars: 233 - Forks: 30

kressi/search-media

Parse media files with Apache Tika, add documents to Lucene index and query this index.

Language: Scala - Size: 30.3 MB - Last synced: 16 days ago - Pushed: about 7 years ago - Stars: 1 - Forks: 0

quarkiverse/quarkus-tika

Quarkus Tika extension

Language: Java - Size: 619 KB - Last synced: 16 days ago - Pushed: 16 days ago - Stars: 10 - Forks: 12

riccardo1980/simple-extractor

Simple test for document extractor

Language: Java - Size: 16.6 KB - Last synced: 18 days ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

shebinleo/pdf2html

pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Language: JavaScript - Size: 939 KB - Last synced: 1 day ago - Pushed: 4 months ago - Stars: 138 - Forks: 29

juanpablo-santos/jspwiki-tika-searchprovider

Apache JSPWiki tika search provider integration sample

Size: 7.81 KB - Last synced: 20 days ago - Pushed: about 5 years ago - Stars: 1 - Forks: 0

fedelemantuano/tika-app-python

Python bindings for Apache Tika

Language: Python - Size: 244 KB - Last synced: 8 days ago - Pushed: over 3 years ago - Stars: 20 - Forks: 7

M-Haertling/WorkforceResearchGuide

This is a UTDallas senior design project developed for Alliance Data. Its purpose is to provide a more robust system for searching through a document repository. This is achieved through high level indexing and the addition of a tagging system. This is a Maven project. Third party libraries used include Apache Lucene, Apache Tika, and SQLite.

Language: Perl - Size: 43.3 MB - Last synced: 22 days ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0

chrismattmann/tika-similarity

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.

Language: Python - Size: 3.2 MB - Last synced: 8 days ago - Pushed: about 2 months ago - Stars: 102 - Forks: 59

DFKI/leechcrawler

Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.

Language: Java - Size: 95.2 MB - Last synced: about 1 month ago - Pushed: 5 months ago - Stars: 8 - Forks: 5

apache/tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Language: Java - Size: 231 MB - Last synced: 27 days ago - Pushed: 28 days ago - Stars: 2,137 - Forks: 740

sarbanandabhikkhu/tipitaka-xml

Roman Tipitaka (CSCD)

Language: JavaScript - Size: 55.6 MB - Last synced: 29 days ago - Pushed: 30 days ago - Stars: 1 - Forks: 0

USCDataScience/sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Language: Java - Size: 23.1 MB - Last synced: 25 days ago - Pushed: about 1 year ago - Stars: 409 - Forks: 142

chrismattmann/MLwithTensorFlow2ed

Code for Machine Learning with TensorFlow: 2nd Edition Published by Manning Publications

Language: Jupyter Notebook - Size: 546 MB - Last synced: 8 days ago - Pushed: over 1 year ago - Stars: 134 - Forks: 68

abhayalekal74/NLP-Information-Extraction

Extracting information from PDF files.

Language: Python - Size: 3.78 MB - Last synced: about 1 month ago - Pushed: about 5 years ago - Stars: 1 - Forks: 0

chrismattmann/imagecat

ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.

Language: Java - Size: 175 MB - Last synced: 8 days ago - Pushed: over 5 years ago - Stars: 94 - Forks: 40

alexferl/tika

Golang client for Apache Tika

Language: Go - Size: 11.7 KB - Last synced: about 2 months ago - Pushed: over 6 years ago - Stars: 6 - Forks: 1

StegarescuAnaMaria/Java_Indexer_and_Searcher

This project is a simulation of a search engine which outputs the path of the documents based on the search string query input.

Language: Java - Size: 15.6 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0

nasa-jpl-memex/memex-explorer

Viewers for statistics and dashboarding of Domain Search Engine data

Language: Python - Size: 14 MB - Last synced: about 3 hours ago - Pushed: over 8 years ago - Stars: 121 - Forks: 69

sergio11/struts2-hibernate

This project demonstrates building a web application with Struts2, Apache Tika, Hibernate, and Wildfly 10. 🚀 Users can upload PDF files, extract text content using Apache Tika, and store metadata in a database using Hibernate. 🔒 Additionally, the project provides instructions for setting up a JDBC Realm on Wildfly 10 for enhanced security.

Language: Java - Size: 140 KB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 0 - Forks: 0

USCDataScience/tika-dockers

A suite of Machine Learning / Deep Learning Dockerfiles to allow Apache Tika to extract objects and to produce textual captions for images and video

Size: 21.5 KB - Last synced: 8 days ago - Pushed: about 1 month ago - Stars: 20 - Forks: 6

KevM/tikaondotnet

Use the Java Tika text extraction library on the .NET platform

Language: Rich Text Format - Size: 155 MB - Last synced: 11 days ago - Pushed: 27 days ago - Stars: 193 - Forks: 73

wbicode/TikaService

A windows service wrapper for the tika JSR 311 network server.

Language: Batchfile - Size: 305 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 1 - Forks: 0

Dimous/tsundoku

Book Management System for e-bibliomaniacs

Language: Java - Size: 89.8 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

tspannhw/nifi-extracttext-processor

Apache NiFi Custom Processor Extracting Text From Files with Apache Tika

Language: Java - Size: 891 KB - Last synced: 24 days ago - Pushed: 9 months ago - Stars: 34 - Forks: 29

vaites/php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

Language: PHP - Size: 13.8 MB - Last synced: 15 days ago - Pushed: 8 months ago - Stars: 111 - Forks: 21

sergio11/document_search_engine_architecture

📄🚀 Unleash a powerful Document Search Engine with Apache NiFi for lightning-fast, comprehensive text indexing and search.

Language: Java - Size: 13.4 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 22 - Forks: 9

arquivo/dspace-link-extractor

Extracts links from DSpace repositories

Language: Java - Size: 62.9 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

welle/JTika

Quick & Dirty project to generate java enumeration class for all mimetype in Apache Tika.

Language: Java - Size: 624 KB - Last synced: 6 months ago - Pushed: almost 7 years ago - Stars: 0 - Forks: 0

nasa-jpl-memex/image_space

Interactive Image similarity and Visual Search and Retrieval application

Language: JavaScript - Size: 2.25 MB - Last synced: 7 months ago - Pushed: about 1 year ago - Stars: 93 - Forks: 46

alexoley/ReadWithMeBot

telegram bot available by username @ReadWithMeBot

Language: Kotlin - Size: 151 KB - Last synced: 7 months ago - Pushed: over 3 years ago - Stars: 0 - Forks: 0

phantom0301/MetaSpider

基于Python和Tika的网络富文本元信息爬虫,Web crawler for rich text meta information based on Python and Tika

Language: Python - Size: 9.77 KB - Last synced: 7 months ago - Pushed: almost 6 years ago - Stars: 3 - Forks: 2

tirthmehta/Apache-Solr-based-Web-Search-Engine

Deployment of a search engine utilizing Apache Solr, Apache Tika and spelling correction programs.

Size: 14.6 KB - Last synced: 7 months ago - Pushed: almost 7 years ago - Stars: 0 - Forks: 0

mrcsparker/ruby_tika_app

A ruby wrapper for the Tika jar (tika-app.jar) that extracts text in a lot of formats from PDF, xls, doc, etc files

Language: DIGITAL Command Language - Size: 415 MB - Last synced: 6 days ago - Pushed: over 1 year ago - Stars: 26 - Forks: 20

Keerthivasan13/CSCI572-Information_Retrieval_And_Web_Search_Engines

Search Engine projects

Language: Java - Size: 34.5 MB - Last synced: 7 months ago - Pushed: almost 4 years ago - Stars: 11 - Forks: 17

nasa-jpl-memex/GeoPath-Clustering

To cluster geo paths that travel very similar paths

Language: HTML - Size: 10.5 MB - Last synced: 7 months ago - Pushed: almost 6 years ago - Stars: 5 - Forks: 7

nasa-jpl-memex/GeoParser

Extract and Visualize location from any file

Language: JavaScript - Size: 159 MB - Last synced: 8 days ago - Pushed: about 1 year ago - Stars: 53 - Forks: 23

liquidinvestigations/hoover-snoop2

Processing system for the search engine service in Liquid Investigations.

Language: Python - Size: 1.74 MB - Last synced: 26 days ago - Pushed: about 1 month ago - Stars: 6 - Forks: 5

catalyst/moodle-search_elastic

An Elasticsearch engine plugin for Moodle's Global Search

Language: PHP - Size: 1.35 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 13 - Forks: 15

httpreserve/tikalinkextract

Tika based link (URL) extractor for httpreserve

Language: HTML - Size: 171 MB - Last synced: 2 days ago - Pushed: almost 3 years ago - Stars: 8 - Forks: 1

lagenorhynque/tika

git diff settings for Microsoft Office files

Language: Shell - Size: 65.8 MB - Last synced: 26 days ago - Pushed: over 6 years ago - Stars: 10 - Forks: 1

whentotrade/Noggle.TikaOnDotNet

.NET Tika Wrapper

Language: Rich Text Format - Size: 95.1 MB - Last synced: 12 days ago - Pushed: almost 5 years ago - Stars: 2 - Forks: 1

sergeyt/pandora

Small box of pandora to prototype your app with ready for use backend. This is just my compilation of different solutions occasionally applied in hackathons and challenges

Language: Go - Size: 1.82 MB - Last synced: 30 days ago - Pushed: 3 months ago - Stars: 26 - Forks: 8

luisbalru/Information-Retrieval

Language: Java - Size: 2.02 MB - Last synced: 9 months ago - Pushed: over 5 years ago - Stars: 1 - Forks: 1

sbelassa/SMIR

smart multimodal information retrieval project

Language: HTML - Size: 26.2 MB - Last synced: 9 months ago - Pushed: about 7 years ago - Stars: 0 - Forks: 0

Journalisme-UQAM/extractionPDF

Trois façons d'extraire le texte de fichiers PDF à l'aide de python

Language: Python - Size: 16.6 KB - Last synced: 9 months ago - Pushed: about 4 years ago - Stars: 1 - Forks: 1

khanium/couchbase-fts-binary

Demo project for uploading binary documents into Couchbase and indexing their metadata & content

Language: JavaScript - Size: 21.7 MB - Last synced: 9 months ago - Pushed: over 1 year ago - Stars: 3 - Forks: 3

public-law/oregon-law-parser

Distill information about amendments to the Oregon Revised Statutes.

Language: Haskell - Size: 50.1 MB - Last synced: 9 days ago - Pushed: 7 months ago - Stars: 17 - Forks: 3

puthurr/tika-docker

Contains a custom tika 1.x server docker image.

Language: Dockerfile - Size: 245 MB - Last synced: 10 months ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0

ipfs-search/ipfs-tika 📦

Java web application taking IPFS hashes, extracting (textual) content and metadata through Apache's Tika.

Language: Java - Size: 52.7 KB - Last synced: 20 minutes ago - Pushed: over 2 years ago - Stars: 30 - Forks: 5

hungneox/tika-php

A PHP client for Apache Tika

Language: PHP - Size: 11.7 KB - Last synced: 10 months ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

procesaur/TExASe

Flask application for OCR and extraction of text from documents with support for repository applications

Language: Python - Size: 14.7 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 1 - Forks: 0

thecogworks/Cogworks.ExamineFileIndexer

An examine indexer that uses Apache Tika.

Language: C# - Size: 23.1 MB - Last synced: 10 months ago - Pushed: over 1 year ago - Stars: 7 - Forks: 6

CogStack/CogStack-Pipeline 📦

Distributed, fault tolerant batch processing for Natural Language Applications and Search, using remote partitioning

Language: Java - Size: 25.6 MB - Last synced: 9 months ago - Pushed: over 1 year ago - Stars: 41 - Forks: 13

ropensci/rtika

R Interface to Apache Tika

Language: R - Size: 133 MB - Last synced: 3 months ago - Pushed: about 1 year ago - Stars: 54 - Forks: 8

schopenhauer/tikka

Flask-based file drop on sterioids, powered by Apache Tika

Language: Python - Size: 4.88 KB - Last synced: 9 months ago - Pushed: almost 2 years ago - Stars: 1 - Forks: 0

codingstar77/Automated-College-Result-Management-System-

It Parses PDF result provided By Pune University automatically into the Database,Generates reports and notifies student about his/her result on email

Language: Java - Size: 504 KB - Last synced: about 1 year ago - Pushed: about 6 years ago - Stars: 2 - Forks: 1

catalyst/moodle-search_postgresfulltext

Moodle search engine implemented using Postgres full text indexing

Language: PHP - Size: 51.8 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 0 - Forks: 7

scotthaleen/py-tika-socket-server

Language: Clojure - Size: 133 KB - Last synced: about 1 year ago - Pushed: over 8 years ago - Stars: 0 - Forks: 1

Sotera/newman

Quickly analyze and explore email with advanced analytics and visualization.

Language: JavaScript - Size: 266 MB - Last synced: 9 months ago - Pushed: over 2 years ago - Stars: 50 - Forks: 14

mixpeek/top-ocr-libraries

Most popular open source OCR libraries listed by accuracy and speed

Size: 4.88 KB - Last synced: 2 months ago - Pushed: over 1 year ago - Stars: 1 - Forks: 1

krish-kunal/task

Helps to parse bank statement(PDF)

Language: Python - Size: 34.4 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 3 - Forks: 0

izveigor/X-MAS-HACK

Веб-приложение, которое предсказывает тип документа по его содержанию 📝

Language: TypeScript - Size: 883 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 2 - Forks: 0

cloudogu/spotter

Content-Type and language recognition library

Language: Java - Size: 246 KB - Last synced: 26 days ago - Pushed: 9 months ago - Stars: 4 - Forks: 2

Anthonyive/DSCI-550-Assignment-1 📦

📧 Analysis of Cyber Phishing Emails: Fraudulent Emails and Social Engineering.

Language: Jupyter Notebook - Size: 70.4 MB - Last synced: about 2 months ago - Pushed: about 3 years ago - Stars: 5 - Forks: 2

Anthonyive/DSCI-550-Assignment-2 📦

👨‍🦰 Large Scale Active Social Engineering Defense (ASED): Multimedia and Social Engineering

Language: HTML - Size: 154 MB - Last synced: about 2 months ago - Pushed: about 3 years ago - Stars: 6 - Forks: 2

mkalus/tika-page-extractor 📦

Tika per page PDF extractor server returning content as JSON.

Language: Java - Size: 19.5 KB - Last synced: about 1 year ago - Pushed: about 8 years ago - Stars: 6 - Forks: 3

chrismattmann/trec-dd-polar

A dataset downloaded from the deep and scientific web across three major Polar data centers for use in research.

Language: Shell - Size: 85 KB - Last synced: 8 days ago - Pushed: over 6 years ago - Stars: 13 - Forks: 7

TheoGicquel/L3-IrisaParser

Parse scientific papers using python

Language: Python - Size: 249 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

chrismattmann/drat

The Distributed Release Audit Tool (DRAT) for code analysis and verification.

Language: JavaScript - Size: 94.7 MB - Last synced: 8 days ago - Pushed: 10 months ago - Stars: 8 - Forks: 1

sarbanandabhikkhu/DhammaChakka

Early Buddhist texts from the Tipitaka (Tripitaka). Suttas (sutras) with the Buddha's teachings on mindfulness, insight, wisdom, and meditation.

Language: JavaScript - Size: 6.31 MB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 0 - Forks: 0

nguyenhiepvan/tika_server_forever Fork of vuthaihoc/tika_server_forever

Run tika server forever with health check process

Language: Shell - Size: 76.7 MB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 1 - Forks: 0

lguberan/LuceneFx

Tiny unofficial javafx demo application for Apache's Lucene and Tika.

Language: Java - Size: 79.1 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

jettdc/semester-search

Semester Search is a utility for quickly searching through downloadable class materials so that you can spend more time learning and less time clicking through dozens of links on your professors' websites.

Language: Go - Size: 66.5 MB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 0 - Forks: 0

jwo29/spring-boot-camunda

spring-boot-camunda

Language: Java - Size: 741 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0

chrisbratlien/aws-bucketeer

Apache Solr/Tika index/search plus SHA256 content-based addressing for files stored into AWS S3 buckets

Language: PHP - Size: 150 KB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

EricLondon/Docker-Rails-Tika-Elasticsearch

Docker Rails Tika Elasticsearch

Language: Ruby - Size: 147 KB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 1 - Forks: 0

Slvkelevra/information-retrieval-system

Information retrieval system for documents.

Language: HTML - Size: 78.9 MB - Last synced: 8 months ago - Pushed: about 2 years ago - Stars: 0 - Forks: 0

graboskyc/MQTTtoRealm

A c# console app to act as MQTT broker and write messages to MongoDB Realm

Language: C# - Size: 116 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

wbicode/TikaService-Installer

A Windows Installer (MSI) for the windows service wrapper of the tika JSR 311 network server.

Language: C# - Size: 80.1 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 1 - Forks: 0

FrodeRanders/disksearch

Indexes a directory hierarchy and provides a crude search interface onto that index

Language: Java - Size: 25.4 KB - Last synced: 16 days ago - Pushed: 2 months ago - Stars: 1 - Forks: 0

opensemanticsearch/tesseract-ocr-cache

Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again

Language: Python - Size: 32.2 KB - Last synced: 6 months ago - Pushed: over 2 years ago - Stars: 5 - Forks: 1

puthurr/tika-fork Fork of apache/tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

Language: Java - Size: 227 MB - Last synced: 10 months ago - Pushed: 10 months ago - Stars: 0 - Forks: 0

opensemanticsearch/tika-server.deb

Apache Tika Server as Debian GNU/Linux and Ubuntu Linux package

Language: Dockerfile - Size: 47.4 MB - Last synced: 6 months ago - Pushed: over 1 year ago - Stars: 5 - Forks: 8

mrspaceman/elibraryserver

Language: Java - Size: 4.88 KB - Last synced: 24 days ago - Pushed: over 2 years ago - Stars: 0 - Forks: 0