Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: ai-alignment

RLHFlow/Directional-Preference-Alignment

Directional Preference Alignment

Size: 1.82 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 31 - Forks: 1

MinghuiChen43/awesome-trustworthy-deep-learning

A curated list of trustworthy deep learning papers. Daily updating...

Size: 6.58 MB - Last synced: 7 days ago - Pushed: 11 days ago - Stars: 290 - Forks: 32

IQTLabs/daisybell

Scan your AI/ML models for problems before you put them into production.

Language: Python - Size: 2.87 MB - Last synced: 14 days ago - Pushed: 15 days ago - Stars: 11 - Forks: 7

tomekkorbak/pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

Language: Python - Size: 135 KB - Last synced: 6 days ago - Pushed: 4 months ago - Stars: 167 - Forks: 14

agencyenterprise/PromptInject

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

Language: Python - Size: 222 KB - Last synced: 17 days ago - Pushed: 3 months ago - Stars: 274 - Forks: 27

dit7ya/awesome-ai-alignment

A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.

Size: 24.4 KB - Last synced: 8 days ago - Pushed: 11 months ago - Stars: 57 - Forks: 9

Giskard-AI/awesome-ai-safety

📚 A curated list of papers & technical articles on AI Quality & Safety

Size: 58.6 KB - Last synced: 20 days ago - Pushed: 8 months ago - Stars: 135 - Forks: 10

riceissa/aiwatch

Website to track people, organizations, and products (tools, websites, etc.) in AI safety

Language: HTML - Size: 1.95 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 19 - Forks: 6

EzgiKorkmaz/adversarial-reinforcement-learning

Reading list for adversarial perspective and robustness in deep reinforcement learning.

Size: 15.6 KB - Last synced: about 2 months ago - Pushed: 8 months ago - Stars: 74 - Forks: 3

phelps-sg/llm-cooperation

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

Language: Python - Size: 10.8 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 9 - Forks: 2

wesg52/sparse-probing-paper

Sparse probing paper full code.

Language: Jupyter Notebook - Size: 50.9 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 29 - Forks: 8

EveryOneIsGross/sinewCHAT

sinewCHAT uses instanced chatbots to emulate neural nodes to enrich and generate positive weighted responses.

Language: Python - Size: 11.7 KB - Last synced: 11 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0

EveryOneIsGross/areteCHAT

A persona chat based on the VIA Character Strengths. Reads emotional tone and summons appropriate virtue to respond.

Language: Python - Size: 0 Bytes - Last synced: 11 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0

EveryOneIsGross/bbBOT

bbBOT is a felixble persona based branching binary sentiment chatbot.

Language: Python - Size: 13.7 KB - Last synced: 11 months ago - Pushed: 11 months ago - Stars: 0 - Forks: 0

liondw/Signal-Alignment

An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.

Size: 25.8 MB - Last synced: 11 months ago - Pushed: 11 months ago - Stars: 14 - Forks: 0

ai-fail-safe/safe-reward

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

Language: Python - Size: 35.2 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 8 - Forks: 0

ai-fail-safe/mulligan

a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ai-fail-safe/gene-drive

a project to ensure that all child processes created by an agent "inherit" the agent's safety controls

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ai-fail-safe/honeypot

a project to detect environment tampering on the part of an agent

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ai-fail-safe/life-span

a project to ensure an artificial agent will eventually reach the end of its existence

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

rmoehn/farlamp

IDA with RL and overseer failures

Language: TeX - Size: 5.39 MB - Last synced: about 1 year ago - Pushed: almost 3 years ago - Stars: 8 - Forks: 0

rmoehn/amplification Fork of paulfchristiano/amplification

An implementation of iterated distillation and amplification

Language: Python - Size: 86.9 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 8 - Forks: 2

rmoehn/jursey

Q&A system with reflection and automation, similar to Patchwork, Affable, Mosaic

Language: Clojure - Size: 684 KB - Last synced: about 1 year ago - Pushed: about 5 years ago - Stars: 3 - Forks: 0

riceissa/miri-top-contributors

Language: HTML - Size: 104 KB - Last synced: about 1 year ago - Pushed: over 5 years ago - Stars: 0 - Forks: 0

Related Keywords
ai-alignment 24 ai-safety 16 fail-safe 5 failsafe 5 anomaly-detection 3 ida 3 robustness 2 chatbot-framework 2 language-models 2 adversarial-attacks 2 gpt-3 2 machine-learning 2 ml-safety 2 awesome 2 awesome-list 2 ai 2 llm 2 ml 2 adversarial-machine-learning 2 rlhf 2 large-language-models 2 sql 1 principal-agent-problem 1 gpt-4 1 gametheory 1 experimental-psychology 1 experimental-economics 1 economics 1 behavioral-economics 1 safe-rlhf 1 safe-reinforcement-learning 1 robust-reinforcement-learning 1 robust-machine-learning 1 robust-adversarial-reinforcement-learning 1 responsible-ai 1 reinforcement-learning-safety 1 reinforcement-learning-generalization 1 multiagent-reinforcement-learning 1 meta-reinforcement-learning 1 machine-learning-safety 1 explainable-rl 1 pretraining 1 education 1 design 1 tree-structure 1 python-ai 1 openai-chatgpt 1 research-project 1 openai 1 virtues 1 via 1 sentiment-analysis 1 supervised-learning 1 chatbot 1 transformer 1 rrn 1 openai-api 1 datomic 1 mechanistic-interpretability 1 interpretability 1 social-dilemmas 1 factored-cognition 1 hch 1 reflection 1 prisoners-dilemma 1 donations-list-website 1 gpt 1 decision-transformers 1 model-poison 1 cybersecurity 1 bias-detection 1 bias-correction 1 ai-assurance 1 watermarking 1 uncertainty 1 security 1 privacy 1 poisoning 1 ownership 1 out-of-distribution-generalization 1 membership-inference-attack 1 machine-unlearning 1 interpretable-deep-learning 1 hallucinations 1 green-ai 1 gradient-leakage 1 fairness 1 deep-learning 1 causality 1 backdoor 1 explainable-machine-learning 1 deep-reinforcement-learning 1 adversarial-reinforcement-learning 1 adversarial-policies 1 php 1 mysql 1 dataset 1 database 1 data-portal 1 aisafety 1