Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: tokenizers

Anush008/tokenizers

Multi-arch bindings for @huggingface/tokenizers.

Language: Rust - Size: 893 KB - Last synced: 6 days ago - Pushed: 8 months ago - Stars: 4 - Forks: 1

jshuadvd/LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

Language: Python - Size: 34.5 MB - Last synced: 25 days ago - Pushed: 25 days ago - Stars: 50 - Forks: 8

s2458588/wsm-tokenizer

Bachelor Thesis Repository. Wsm-tokenizer (word shape mapping) uses vocabulary comparisons to find probable morphemes in lexemic tokens.

Language: Jupyter Notebook - Size: 2.41 MB - Last synced: 25 days ago - Pushed: over 1 year ago - Stars: 0 - Forks: 0

kojix2/blingfire-crystal

Language: Crystal - Size: 49.8 KB - Last synced: about 1 month ago - Pushed: about 1 year ago - Stars: 2 - Forks: 0

helena-intel/test-prompt-generator

Create prompts with a given token length for testing LLMs and other transformers text models.

Language: Python - Size: 171 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

Prismadic/magnet

the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly

Language: Python - Size: 8.82 MB - Last synced: 21 days ago - Pushed: about 2 months ago - Stars: 18 - Forks: 1

xebia-functional/xef

Building applications with LLMs through composability, in Kotlin, Scala, ...

Language: Kotlin - Size: 12.3 MB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 161 - Forks: 16

symanto-research/merge-tokenizers

Package to align tokens from different tokenizations.

Language: Python - Size: 347 KB - Last synced: 28 days ago - Pushed: about 2 months ago - Stars: 0 - Forks: 0

Hugging-Face-Supporter/tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

Language: Python - Size: 263 KB - Last synced: 3 months ago - Pushed: about 2 years ago - Stars: 5 - Forks: 2

DanielPFlorian/Transformers-Github-Semantic-Search

NLP Dataset Creation and Semantic Search Demonstration

Language: Jupyter Notebook - Size: 18.6 KB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 0 - Forks: 0

OmkarBorhade98/Text_Summarization

Text Summarization using NLP

Language: Jupyter Notebook - Size: 116 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

wenbingl/tfmtok

The tokenizer C/C++ library for transformers model

Language: C++ - Size: 8.61 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 1 - Forks: 0

megagonlabs/ginza-transformers

Use custom tokenizers in spacy-transformers

Language: Python - Size: 32.2 KB - Last synced: 4 months ago - Pushed: almost 2 years ago - Stars: 16 - Forks: 4

mickymultani/LLM-Architecture

Visualize some important concepts related to LLM architectures.

Language: Jupyter Notebook - Size: 9.74 MB - Last synced: 5 months ago - Pushed: 7 months ago - Stars: 1 - Forks: 0

victoryosiobe/kingchop

Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.

Language: JavaScript - Size: 19.5 KB - Last synced: about 1 month ago - Pushed: 4 months ago - Stars: 1 - Forks: 0

Beomi/megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

Language: Python - Size: 498 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 2 - Forks: 0

unfoldingWord/string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

Language: JavaScript - Size: 2.14 MB - Last synced: 6 days ago - Pushed: 10 months ago - Stars: 8 - Forks: 1

arturom/search-analysis

A graphical user interface for the Elasticsearch Analyze API

Language: JavaScript - Size: 4.67 MB - Last synced: 8 months ago - Pushed: 8 months ago - Stars: 5 - Forks: 0

sayakpaul/count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

Language: Python - Size: 19.5 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 15 - Forks: 1

jungsoh/transformers-question-answering

Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering

Language: Jupyter Notebook - Size: 379 MB - Last synced: about 1 year ago - Pushed: over 2 years ago - Stars: 1 - Forks: 0

Matesxs/CodeTransformer

Language: Python - Size: 54.7 KB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

Related Keywords