Ecosyste.ms: Repos

An open API service providing repository metadata for many open source software ecosystems.

GitHub topics: ai-safety

Repositories

StampyAI/stampy-ui

AI Safety Q&A web frontend

Language: TypeScript - Size: 19.5 MB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 31 - Forks: 8

Nkluge-correa/Aira

Aira is a series of chatbots developed as an experimentation playground for value alignment.

Language: Jupyter Notebook - Size: 682 MB - Last synced: about 13 hours ago - Pushed: about 15 hours ago - Stars: 4 - Forks: 0

WindVChen/VCO-AP

A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.

Language: Python - Size: 21.9 MB - Last synced: 5 days ago - Pushed: 5 days ago - Stars: 8 - Forks: 0

yyy01/PAC

The official implementation of the paper "Data Contamination Calibration for Black-box LLMs" (ACL 2024)

Language: Python - Size: 210 KB - Last synced: 7 days ago - Pushed: 7 days ago - Stars: 7 - Forks: 0

jphall663/awesome-machine-learning-interpretability

A curated list of awesome responsible machine learning resources.

Size: 1.56 MB - Last synced: 9 days ago - Pushed: 11 days ago - Stars: 3,466 - Forks: 575

IQTLabs/daisybell

Scan your AI/ML models for problems before you put them into production.

Language: Python - Size: 2.87 MB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 11 - Forks: 7

tomekkorbak/pretraining-with-human-feedback

Code accompanying the paper Pretraining Language Models with Human Preferences

Language: Python - Size: 135 KB - Last synced: 6 days ago - Pushed: 3 months ago - Stars: 167 - Forks: 14

IQTLabs/aia-platform 📦

Hardened AI Assurance reference platform

Language: Python - Size: 68.4 KB - Last synced: 14 days ago - Pushed: over 1 year ago - Stars: 0 - Forks: 1

dynaroars/neuralsat

DPLL(T)-based Verification tool for DNNs

Language: Python - Size: 208 MB - Last synced: 14 days ago - Pushed: 14 days ago - Stars: 9 - Forks: 0

dynaroars/vnncomp-benchmark-generation

Language: Python - Size: 136 MB - Last synced: 16 days ago - Pushed: 16 days ago - Stars: 0 - Forks: 0

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

Language: Python - Size: 222 KB - Last synced: 17 days ago - Pushed: 3 months ago - Stars: 274 - Forks: 27

Giskard-AI/giskard

🐢 Open-Source Evaluation & Testing framework for LLMs and ML models

Language: Python - Size: 176 MB - Last synced: 20 days ago - Pushed: 22 days ago - Stars: 3,163 - Forks: 199

normster/llm_rules

RuLES: a benchmark for evaluating rule-following in language models

Language: Python - Size: 2.82 MB - Last synced: 6 days ago - Pushed: 6 days ago - Stars: 194 - Forks: 13

Dunchead/ai-safety

Mapping AI risks and possible solutions

Language: JavaScript - Size: 138 KB - Last synced: 20 days ago - Pushed: 21 days ago - Stars: 2 - Forks: 0

jacksonkarel/recursive-other-improvement

Language: Jupyter Notebook - Size: 242 KB - Last synced: 20 days ago - Pushed: 20 days ago - Stars: 7 - Forks: 1

dit7ya/awesome-ai-alignment

A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.

Size: 24.4 KB - Last synced: 8 days ago - Pushed: 11 months ago - Stars: 57 - Forks: 9

Giskard-AI/awesome-ai-safety

📚 A curated list of papers & technical articles on AI Quality & Safety

Size: 58.6 KB - Last synced: 20 days ago - Pushed: 8 months ago - Stars: 135 - Forks: 10

AlexTMjugador/redwoodresearch-interp-docker

📦 Redwood Research's transformer interpretability tools, conveniently packaged in a Docker container for simple and reproducible deployments.

Language: Dockerfile - Size: 5.86 KB - Last synced: 26 days ago - Pushed: about 1 month ago - Stars: 0 - Forks: 0

PAIR-code/farsight

In situ interactive widgets for responsible AI 🌱

Language: TypeScript - Size: 21 MB - Last synced: 25 days ago - Pushed: 3 months ago - Stars: 10 - Forks: 2

tamlhp/awesome-privex

Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)

Size: 439 KB - Last synced: about 1 month ago - Pushed: about 1 month ago - Stars: 17 - Forks: 0

PKU-Alignment/safe-rlhf

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Language: Python - Size: 4.01 MB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 1,137 - Forks: 92

ryoungj/ToolEmu

A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

Language: Python - Size: 4.05 MB - Last synced: about 1 month ago - Pushed: 2 months ago - Stars: 85 - Forks: 7

riceissa/aiwatch

Website to track people, organizations, and products (tools, websites, etc.) in AI safety

Language: HTML - Size: 1.95 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 19 - Forks: 6

EzgiKorkmaz/adversarial-reinforcement-learning

Reading list for adversarial perspective and robustness in deep reinforcement learning.

Size: 15.6 KB - Last synced: about 2 months ago - Pushed: 8 months ago - Stars: 74 - Forks: 3

microsoft/SafeNLP

Safety Score for Pre-Trained Language Models

Language: Python - Size: 1.2 MB - Last synced: about 2 months ago - Pushed: 7 months ago - Stars: 82 - Forks: 7

zhoumingyi/CustomDLCoder

Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24

Language: Python - Size: 155 MB - Last synced: about 2 months ago - Pushed: about 2 months ago - Stars: 1 - Forks: 0

hendrycks/ethics

Aligning AI With Shared Human Values (ICLR 2021)

Language: Python - Size: 351 KB - Last synced: 2 months ago - Pushed: about 1 year ago - Stars: 201 - Forks: 31

ztjona/ztjona.github.io

My personal website.

Language: HTML - Size: 44.5 MB - Last synced: 2 months ago - Pushed: 2 months ago - Stars: 0 - Forks: 0

SafeAILab/RAIN

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning

Language: Python - Size: 286 KB - Last synced: 2 months ago - Pushed: 4 months ago - Stars: 51 - Forks: 3

phelps-sg/llm-cooperation

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

Language: Python - Size: 10.8 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 9 - Forks: 2

ShengranHu/Thought-Cloning

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

Language: Python - Size: 12.6 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 223 - Forks: 21

Nkluge-correa/Model-Library

The Model Library is a project that maps the risks associated with modern machine learning systems.

Language: Python - Size: 495 KB - Last synced: about 1 month ago - Pushed: about 2 months ago - Stars: 1 - Forks: 1

gserapio/intersectional-ai-safety

R code for Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety

Language: R - Size: 26.4 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

dlmacedo/distinction-maximization-loss

A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.

Language: Python - Size: 2.45 MB - Last synced: 2 months ago - Pushed: over 1 year ago - Stars: 45 - Forks: 4

WindVChen/DiffAttack

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

Language: Python - Size: 93.8 MB - Last synced: 3 months ago - Pushed: 3 months ago - Stars: 98 - Forks: 10

cure-lab/ContraNet

This is the official implementation of ContraNet (NDSS2022).

Language: Python - Size: 39.2 MB - Last synced: 14 days ago - Pushed: 9 months ago - Stars: 17 - Forks: 2

ai4ce/FLAT

[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory

Language: Python - Size: 48.9 MB - Last synced: 3 months ago - Pushed: almost 2 years ago - Stars: 63 - Forks: 11

wesg52/universal-neurons

Universal Neurons in GPT2 Language Models

Language: Jupyter Notebook - Size: 24.5 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 7 - Forks: 1

levitation-opensource/ai-safety-gridworlds Fork of google-deepmind/ai-safety-gridworlds

Extended, multi-agent and multi-objective environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.

Language: Python - Size: 1.26 MB - Last synced: 4 months ago - Pushed: 5 months ago - Stars: 6 - Forks: 0

oscaem/preparedness-challenge

This project contains a proof of concept outlining the potential misuse of contemporary Artificial Intelligence models to influence public perception, highlighting the need to engineer robust defenses against such threats to ensure safety of our political systems. Entry for the OpenAI Preparedness Challenge.

Size: 401 KB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

romaingrx/Second-Order-Jailbreak

NeurIPS workshop : We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations.

Language: Python - Size: 13.6 MB - Last synced: 30 days ago - Pushed: 6 months ago - Stars: 5 - Forks: 0

LuanAdemi/toumei

An interpretability library for pytorch

Language: Python - Size: 14.3 MB - Last synced: 5 months ago - Pushed: over 1 year ago - Stars: 9 - Forks: 2

neelsoumya/ai_outreach

Resources for explaining AI to the public and outreach activities

Size: 7.99 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 1 - Forks: 1

wesg52/sparse-probing-paper

Sparse probing paper full code.

Language: Jupyter Notebook - Size: 50.9 MB - Last synced: 5 months ago - Pushed: 5 months ago - Stars: 29 - Forks: 8

tigerlab-ai/tiger

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

Language: Jupyter Notebook - Size: 1.84 MB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 335 - Forks: 23

CDEIUK/bias-mitigation

Machine Learning Bias Mitigation

Language: Jupyter Notebook - Size: 74.4 MB - Last synced: 4 months ago - Pushed: about 2 years ago - Stars: 6 - Forks: 6

jehumtine/LAWLIA

LAWLIA is an open-source computational legal framework designed to revolutionize legal reasoning and analysis. It combines the power of large language models with a structured syntactical grammar to facilitate precise legal assessments, truth values, and verdicts. LAWLIA is the future of computational jurisprudence

Language: Python - Size: 122 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 7 - Forks: 1

google-research-datasets/aart-ai-safety-dataset

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Size: 214 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

HorizonEventsAgency/tracker

Automated tracking of events related to AI safety

Size: 1.95 KB - Last synced: 6 months ago - Pushed: 9 months ago - Stars: 1 - Forks: 0

mccaffary/AGI-safety-governance-practices

Analysis of the survey "Towards best practices in AGI safety and governance: A survey of expert opinion"

Language: Jupyter Notebook - Size: 3.69 MB - Last synced: about 2 months ago - Pushed: 10 months ago - Stars: 8 - Forks: 2

kevinrobinson-at-elgoog/aart-ai-safety-dataset

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Size: 292 KB - Last synced: 6 months ago - Pushed: 6 months ago - Stars: 0 - Forks: 0

campbellborder/spar-aaron-dolphin

Language: Jupyter Notebook - Size: 25.9 MB - Last synced: 4 months ago - Pushed: 4 months ago - Stars: 0 - Forks: 0

thesofakillers/nlgoals

Official repository for my MSc thesis: "Addressing Goal Misgeneralization with Natural Language Interfaces."

Language: TeX - Size: 35.7 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 1 - Forks: 0

PKU-Alignment/beavertails

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

Language: Makefile - Size: 2.33 MB - Last synced: 7 months ago - Pushed: 7 months ago - Stars: 42 - Forks: 1

ebagdasa/mithridates

Measure and Boost Backdoor Robustness

Language: Jupyter Notebook - Size: 1.13 MB - Last synced: 28 days ago - Pushed: 28 days ago - Stars: 6 - Forks: 3

Omegastick/credit-hacking

Eliciting credit hacking behaviours in large language models

Language: Python - Size: 1.12 MB - Last synced: 9 months ago - Pushed: 9 months ago - Stars: 0 - Forks: 0

ongov/AI-Principles

Alpha principles for the ethical use of AI and Data Driven Technologies in Ontario | Proposition de principes pour une utilisation éthique des technologies axées sur les données en Ontario

Size: 66.4 KB - Last synced: about 1 month ago - Pushed: almost 3 years ago - Stars: 24 - Forks: 5

EffectiveAltruismUCT/indabaX-ai-safety-workshop-2023

IndabaX AI Safety Workshop 2023

Size: 7.74 MB - Last synced: 6 days ago - Pushed: 11 months ago - Stars: 1 - Forks: 0

megvii-research/FSSD_OoD_Detection

Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)

Language: Python - Size: 477 KB - Last synced: 10 months ago - Pushed: over 3 years ago - Stars: 73 - Forks: 11

esbenkc/benchmarks

📊 Benchmarking the safety of AI systems

Language: Jupyter Notebook - Size: 3.91 KB - Last synced: 30 days ago - Pushed: 11 months ago - Stars: 1 - Forks: 0

Jakobovski/ai-safety-cheatsheet

A compilation of AI safety ideas, problems and solutions.

Size: 8.79 KB - Last synced: 10 months ago - Pushed: about 1 year ago - Stars: 7 - Forks: 0

governanceai/AGI-safety-and-governance-practices Fork of mccaffary/AGI-safety-governance-practices

Analysis of the survey "Towards best practices in AGI safety and governance: A survey of expert opinion"

Size: 4.88 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 0 - Forks: 0

lancopku/Avg-Avg

[Findings of EMNLP 2022] Holistic Sentence Embeddings for Better Out-of-Distribution Detection

Language: Python - Size: 46.9 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 11 - Forks: 3

cool-RR/stubborn

Stubborn: An Environment for Evaluating Stubbornness between Agents with Aligned Incentives

Language: Python - Size: 255 KB - Last synced: 25 days ago - Pushed: 12 months ago - Stars: 1 - Forks: 0

endlessloop2/UC-AI-Thinkathon-2023

Winning entry for the UC Chile AI Safety Thinkathon 2023. Coauthor @mon-b

Language: R - Size: 58.1 MB - Last synced: 7 months ago - Pushed: about 1 year ago - Stars: 2 - Forks: 0

lancopku/DAN

[Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks

Language: Python - Size: 17.6 KB - Last synced: about 1 year ago - Pushed: about 1 year ago - Stars: 5 - Forks: 0

yardenas/la-mbda

LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization

Language: Python - Size: 49.8 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 19 - Forks: 11

imrehg/cdei-development 📦

Machine Learning Bias Exploration

Language: Jupyter Notebook - Size: 49.8 MB - Last synced: about 2 months ago - Pushed: almost 4 years ago - Stars: 1 - Forks: 1

dlmacedo/entropic-out-of-distribution-detection

A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.

Language: Python - Size: 7.71 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 51 - Forks: 8

ai-fail-safe/safe-reward

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

Language: Python - Size: 35.2 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 8 - Forks: 0

ai-fail-safe/mulligan

a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ai-fail-safe/gene-drive

a project to ensure that all child processes created by an agent "inherit" the agent's safety controls

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ai-fail-safe/honeypot

a project to detect environment tampering on the part of an agent

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ai-fail-safe/life-span

a project to ensure an artificial agent will eventually reach the end of its existence

Size: 1.95 KB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 0

ravipatelxyz/nlp-ethics

In depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).

Language: Jupyter Notebook - Size: 26.6 MB - Last synced: about 2 months ago - Pushed: almost 3 years ago - Stars: 2 - Forks: 1

danielmamay/nlp-ethics Fork of ravipatelxyz/nlp-ethics

In-depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).

Size: 26.6 MB - Last synced: about 2 months ago - Pushed: almost 3 years ago - Stars: 0 - Forks: 0

lasgroup/safe-adaptation-agents

Implementation of adaptive constrained RL algorithms. Child repository of @lasgroup/safe-adaptation-gym

Language: Python - Size: 19.8 MB - Last synced: about 1 year ago - Pushed: over 1 year ago - Stars: 1 - Forks: 1

zmahoor/TPR-1.0

Language: C++ - Size: 925 MB - Last synced: 10 months ago - Pushed: over 6 years ago - Stars: 1 - Forks: 0

RongRG/saferRL

An educational resource to help anyone learn safe reinforcement learning, inspired by openai/spinningup

Language: Python - Size: 112 KB - Last synced: about 1 year ago - Pushed: about 2 years ago - Stars: 1 - Forks: 0

rmoehn/amplification Fork of paulfchristiano/amplification

An implementation of iterated distillation and amplification

Language: Python - Size: 86.9 KB - Last synced: about 1 year ago - Pushed: almost 2 years ago - Stars: 8 - Forks: 2

ai-safety 83 ai-alignment 16 machine-learning 15 ai 13 llm 10 deep-learning 10 artificial-intelligence 10 large-language-models 9 reinforcement-learning 7 responsible-ai 6 adversarial-attacks 6 anomaly-detection 6 natural-language-processing 5 pytorch 5 fail-safe 5 failsafe 5 robust-machine-learning 4 trustworthy-ai 4 trustworthy-machine-learning 4 fairness-ai 4 ml-safety 4 gpt-3 4 ai-assurance 4 ood-detection 4 alignment 4 interpretability 4 nlp 4 awesome 4 gpt-4 3 robustness 3 rlhf 3 awesome-list 3 llms 3 gpt 3 aisafety 3 out-of-distribution-detection 3 bias-correction 3 language-model 3 transformer 3 safe-rlhf 3 python 3 safe-reinforcement-learning 3 expert-survey 2 mechanistic-interpretability 2 robotics 2 llmops 2 ml-testing 2 ethical-artificial-intelligence 2 artificial-intelligence-safety 2 mlops 2 artificial-intelligence-governance 2 ai-security 2 ai-governance 2 ai-risk 2 agents 2 responsible-ml 2 ml-fairness 2 ethical-ai 2 ml 2 rl 2 machine-learning-models 2 fairness-ml 2 safety 2 llama 2 datasets 2 beaver 2 cybersecurity 2 prompt-engineering 2 ood 2 novelty-detection 2 language-models 2 agi 2 classification 2 out-of-distribution 2 backdoor-attacks 2 xai 2 r 2 open-set 2 open-set-recognition 2 osr 2 pluralism 1 pettingzoo 1 donations-list-website 1 uncertainty-estimation 1 multiobjective-learning 1 lidar 1 reinforcement-learning-environments 1 sideeffects 1 simulation-environment 1 simulation-framework 1 sql 1 openai 1 preparedness 1 research 1 point-cloud 1 gnss 1 artificial-general-intelligence 1 constraint-satisfaction-problem 1 gridworld 1 autonomous-driving 1