Project Ideas Extracted from ICLR 2025

Level	Paper Title	Rating	RD	Volatility	Score	Opponents	Active	Primary Area
undergraduate	$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics	2180.54	148.06	0.06	7	Ha6RTeWMd0_undergraduate qyU5s4fzLg_undergraduate aMBSY2ebPw_undergraduate pQqeQpMkE7_undergraduate SctfBCLmWo_undergraduate ZCOwwRAaEl_undergraduate APojAzJQiq_undergraduate	Yes	datasets and benchmarks
undergraduate	Instance-dependent Early Stopping	2180.54	148.06	0.06	7	gYWqxXE5RJ_undergraduate fovPyqPcKY_undergraduate hpCfPEvBsr_undergraduate LbgIZpSUCe_undergraduate WcZLG8XxhD_undergraduate syThiTmWWm_undergraduate ZXaocmXc6d_undergraduate	Yes	unsupervised, self-supervised, semi-supervised, and supervised representation learning
undergraduate	Shape as Line Segments: Accurate and Flexible Implicit Surface Representation	2180.54	148.06	0.06	7	wJv4AIt4sK_undergraduate wg3rBImn3O_undergraduate OheAR2xrtb_undergraduate qeXcMutEZY_undergraduate SqZ0KY4qBD_undergraduate Y6LPWBo2HP_undergraduate cH65nS5sOz_undergraduate	Yes	learning on graphs and other geometries & topologies
undergraduate	This paper was rejected	2180.54	148.06	0.06	7	tjNf0L8QjR_undergraduate alBn1uNTLi_undergraduate BkeJro1xps_undergraduate u9Z6gL5MlL_undergraduate WhIuLQWCWS_undergraduate IUmj2dw5se_undergraduate 8aKygnbEFX_undergraduate	Yes
undergraduate	This paper was rejected	2180.54	148.06	0.06	7	bsnRUkVn63_undergraduate ZACAKudvKW_undergraduate qSEEQPNbu4_undergraduate 5IWJBStfU7_undergraduate d4njmzM7jf_undergraduate RfYD6v829Y_undergraduate 1poUSIGSCI_undergraduate	Yes
undergraduate	This paper was rejected	2180.54	148.06	0.06	7	xom3YUQfbK_undergraduate EDoD3DgivF_undergraduate 4GD7a9Bo9A_undergraduate jDsmB4o5S0_undergraduate 4ikjWBs3tE_undergraduate mFHPoYVeqN_undergraduate UatDdAlr2x_undergraduate	Yes
undergraduate	Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control	2129.55	151.05	0.06	7	Ha6RTeWMd0_undergraduate cD1kl2QKv1_undergraduate qhfZL46nPV_undergraduate SctfBCLmWo_undergraduate YXewbZ8FgU_undergraduate 0QvLISYIKM_undergraduate rLX7Vyyzus_undergraduate	Yes	interpretability and explainable AI
undergraduate	ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks	2067.99	148.06	0.06	6	84pDoCD4lH_undergraduate nwDRD4AMoN_undergraduate eIJfOIMN9z_undergraduate jXLiDKsuDo_undergraduate KijslFbfOL_undergraduate dRXxFEY8ZE_undergraduate N8Oj1XhtYZ_undergraduate	Yes	applications to physical sciences (physics, chemistry, biology, etc.)
undergraduate	UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models	2067.99	148.06	0.06	6	P42DbV2nuV_undergraduate HD6bWcj87Y_undergraduate AEFVa6VMu1_undergraduate kxnoqaisCT_undergraduate wkHcXDv7cv_undergraduate eajZpoQkGK_undergraduate SI2hI0frk6_undergraduate	Yes	datasets and benchmarks
undergraduate	Provably Accurate Shapley Value Estimation via Leverage Score Sampling	2067.99	148.06	0.06	6	R22JPTQYWV_undergraduate QWunLKbBGF_undergraduate RavSZTIe2s_undergraduate UvMSKonce8_undergraduate JDm7oIcx4Y_undergraduate tnB94WQGrn_undergraduate KYOdZRR6nr_undergraduate	Yes	interpretability and explainable AI
undergraduate	CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models	2067.99	148.06	0.06	6	zxqdVo9FjY_undergraduate K9zedJlybd_undergraduate xOmC5LiVuN_undergraduate 3ogIALgghF_undergraduate zvaiz3FjA9_undergraduate feFlfuOse1_undergraduate gY08Ou8EL7_undergraduate	Yes	datasets and benchmarks
undergraduate	Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?	2067.99	148.06	0.06	6	bWT6OBJ71x_undergraduate Yt9CFhOOFe_undergraduate Tl8EzmgsEp_undergraduate AqueuvXErD_undergraduate lOfuvmi2HT_undergraduate 2jzhImk4br_undergraduate 4011PUI9vm_undergraduate	Yes	interpretability and explainable AI
undergraduate	This paper was rejected	2067.99	148.06	0.06	6	tcsZt9ZNKD_undergraduate 9ca9eHNrdH_undergraduate dVLrqe2a7c_undergraduate XrsOu4KgDE_undergraduate fn0mjkZopf_undergraduate SzWvRzyk6h_undergraduate bVTM2QKYuA_undergraduate	Yes
undergraduate	Language Representations Can be What Recommenders Need: Findings and Potentials	2052.00	148.06	0.06	6	Nx4PMtJ1ER_undergraduate gYWqxXE5RJ_undergraduate VOAMTA8jKu_undergraduate APojAzJQiq_undergraduate PbheqxnO1e_undergraduate xQBRrtQM8u_undergraduate I4e82CIDxv_undergraduate	Yes	other topics in machine learning (i.e., none of the above)
undergraduate	Approximation algorithms for combinatorial optimization with predictions	2052.00	148.06	0.06	6	IwPXYk6BV9_undergraduate T1OvCSFaum_undergraduate fovPyqPcKY_undergraduate 07yvxWDSla_undergraduate OheAR2xrtb_undergraduate FAfxvdv1Dy_undergraduate msEr27EejF_undergraduate	Yes	optimization
undergraduate	This paper was rejected	2052.00	148.06	0.06	6	TJU9J8iQXL_undergraduate gvZpk0n68q_undergraduate UvMSKonce8_undergraduate IUmj2dw5se_undergraduate yPyb2j7oZc_undergraduate M8xtZuxqC5_undergraduate 4ihkxIeTFH_undergraduate	Yes
undergraduate	This paper was rejected	2052.00	148.06	0.06	6	j1jtyGdD4O_undergraduate nTlzEM1x3B_undergraduate K9zedJlybd_undergraduate o6ddWvoyjK_undergraduate FXPsZ6cbUj_undergraduate gWrWUaCbMa_undergraduate 1poUSIGSCI_undergraduate	Yes
undergraduate	This paper was rejected	2052.00	148.06	0.06	6	eciCtsqGc8_undergraduate rGyi8NNqB0_undergraduate Ni4jNyroJZ_undergraduate 5IWJBStfU7_undergraduate Q3RoP5IhHy_undergraduate 4ikjWBs3tE_undergraduate zbpzJmRNiZ_undergraduate	Yes
undergraduate	This paper was rejected	2045.23	148.33	0.06	6	lGDmwb12Qq_undergraduate 1Njl73JKjB_undergraduate mDV36U4d6u_undergraduate bWz8aOPwsJ_undergraduate iTjSqQQ4f8_undergraduate kTjEPEy96Q_undergraduate XrsOu4KgDE_undergraduate	Yes
undergraduate	Improving Unsupervised Constituency Parsing via Maximizing Semantic Information	2031.51	148.06	0.06	6	GGlpykXDCa_undergraduate N8Oj1XhtYZ_undergraduate kbjJ9ZOakb_undergraduate fU8H4lzkIm_undergraduate pISLZG7ktL_undergraduate U67J0QNtzo_undergraduate dRXxFEY8ZE_undergraduate	Yes	applications to computer vision, audio, language, and other modalities

undergraduate

$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics

Title: Evaluating the Impact of Audio Data Augmentation on Avian Bioacoustic Classification using the BirdSet Benchmark

Abstract:

Data augmentation is crucial for improving model robustness, especially when training and testing distributions differ, as noted in the BirdSet paper (focal vs. soundscape). This project investigates the impact of common audio data augmentation techniques on avian bioacoustic classification performance using the BirdSet benchmark. We will train a standard Convolutional Neural Network (CNN) on a subset of BirdSet's training data (XCM) with and without specific augmentations (e.g., noise mixing, pitch shifting). Performance will be evaluated on a soundscape test set (e.g., NES) to quantify the effect of augmentation.

Experiments:

Implement two audio augmentation techniques suitable for bird sounds (e.g., mixing with background noise from BirdSet's VOX subset, applying pitch shifts within typical avian frequency ranges) using Python libraries like `torchaudio` or `audiomentations`.
Train a baseline CNN model (e.g., EfficientNet-B0 or a similar standard architecture) on the XCM subset of BirdSet without augmentations.
Train the same CNN architecture on the XCM subset, applying the implemented augmentations during training.
Compare the multi-label classification performance (AUROC, cmAP) of the baseline and augmented models on the NES test dataset.

Related Work:

This project builds directly on the BirdSet paper by using its datasets (XCM, NES, VOX) and addressing the challenge of covariate shift between focal training data and soundscape test data. While the paper benchmarks models, it doesn't systematically evaluate various data augmentation strategies. This project fills that gap by applying standard augmentation techniques, discussed in general audio ML literature, specifically to the BirdSet context to improve model generalization.

Risk Factors & Limitations:

Requires access to a GPU for feasible training times, even on the XCM subset.
Choice of augmentation parameters (e.g., noise level, pitch shift range) can significantly impact results.
Limited to evaluating only a few augmentation types and one model architecture.
Performance improvements might be modest depending on the chosen augmentations and subset.

Testable Hypothesis:

Applying a combination of background noise mixing (using BirdSet's VOX data) and moderate pitch shifting during training will increase the AUROC score by at least 3% on the NES test set compared to a baseline CNN model trained without these augmentations on the BirdSet XCM subset.

2180.54

148.06

0.06

7

Ha6RTeWMd0_undergraduate
qyU5s4fzLg_undergraduate
aMBSY2ebPw_undergraduate
pQqeQpMkE7_undergraduate
SctfBCLmWo_undergraduate
ZCOwwRAaEl_undergraduate
APojAzJQiq_undergraduate

Yes

datasets and benchmarks

undergraduate

Instance-dependent Early Stopping

Title: Implementation and Empirical Evaluation of Instance-dependent Early Stopping for Image Classification

Abstract:

Instance-dependent Early Stopping (IES) is proposed as a method to accelerate machine learning training by skipping backpropagation for 'mastered' instances. This project aims to implement and evaluate the IES technique using standard ML libraries (PyTorch/TensorFlow). The core approach involves modifying a standard image classification training loop to incorporate the IES logic based on second-order loss differences. Experiments will compare training time and accuracy of models trained with IES versus standard training and conventional early stopping on datasets like MNIST, Fashion-MNIST, and potentially a small domain-specific dataset (e.g., a subset of HAM10000 skin lesions). The expected outcome is a practical understanding of IES implementation and an empirical validation of its effectiveness on selected datasets.

Experiments:

Implement the IES mechanism within a standard PyTorch/TensorFlow training loop for image classification, calculating second-order loss differences and selectively skipping backpropagation based on threshold δ.
Train image classification models (e.g., a simple CNN or ResNet-18) on MNIST and Fashion-MNIST datasets using standard training, conventional early stopping, and the implemented IES method.
Compare the methods based on wall-clock training time, total backpropagation steps, and final test accuracy, analyzing the effect of the IES threshold parameter δ on the results.

Related Work:

This project directly implements the method described in the 'Instance-dependent Early Stopping' paper. It builds upon standard machine learning concepts like early stopping and optimization algorithms (e.g., SGD, Adam). The novelty lies in the hands-on implementation and empirical evaluation of the IES technique, potentially on a dataset not extensively covered in the original paper, providing practical validation.

Risk Factors & Limitations:

Requires proficiency in Python and a chosen ML framework (PyTorch/TensorFlow) to modify training loops.
Debugging issues related to loss tracking and gradient computation might be challenging.
Performance gains might be less pronounced on very simple datasets/models compared to those shown in the paper.

Testable Hypothesis:

Implementing Instance-dependent Early Stopping (IES) in a standard ML training pipeline will reduce the wall-clock training time by at least 10% compared to standard training on MNIST/Fashion-MNIST, while achieving a final test accuracy within 1% of the baseline model.

2180.54

148.06

0.06

7

gYWqxXE5RJ_undergraduate
fovPyqPcKY_undergraduate
hpCfPEvBsr_undergraduate
LbgIZpSUCe_undergraduate
WcZLG8XxhD_undergraduate
syThiTmWWm_undergraduate
ZXaocmXc6d_undergraduate

Yes

unsupervised, self-supervised, semi-supervised, and supervised representation learning

undergraduate

Shape as Line Segments: Accurate and Flexible Implicit Surface Representation

Title: Implementation and Evaluation of 2D Edge-Based Dual Contouring for Shape Reconstruction from Line Segment Fields

Abstract:

The 'Shape as Line Segments' (SALS) paper proposes a new way to represent 3D shapes using a Line Segment Field (LSF) and introduces an Edge-based Dual Contouring (E-DC) algorithm to extract surfaces. This project implements and evaluates E-DC in a simplified 2D setting. Given a predefined 2D LSF on a grid (indicating which grid edges intersect a shape and where), the goal is to implement the E-DC logic to extract the 2D contour. We will test the implementation on synthetic 2D shapes with features like sharp corners and compare its accuracy against the ground truth shape and potentially a simpler baseline like Marching Squares.

Experiments:

Implement the core logic of the Edge-based Dual Contouring (E-DC) algorithm for a 2D grid, focusing first on solving the 2D Quadratic Error Function (QEF) within each grid cell based on intersecting edges and their estimated normals (derived from LSF 's' attribute gradients as per Theorem 1, simplified for 2D).
Generate simple 2D LSF data for synthetic shapes (e.g., square, star, circle) defined on a grid. This involves specifying the intersection flag ('o') and ratio ('s') for relevant grid edges.
Extract the 2D contour (polyline) from the LSF data using the implemented E-DC algorithm.
Visualize the input LSF data (intersecting edges), the intermediate QEF solutions (points within cells), and the final extracted contour using a plotting library (e.g., Matplotlib).
Quantitatively evaluate the accuracy of the extracted contour against the ground truth shape using metrics like Hausdorff distance or point-to-curve distance.
Design and test specific cases to probe edge conditions: sharp corners, T-junctions, closely spaced features.
(Optional) Compare E-DC results against a basic 2D Marching Squares implementation using the same LSF intersection information.

Related Work:

This project implements the Edge-based Dual Contouring (E-DC) algorithm presented in Section 3.2 of the 'Shape as Line Segments' (SALS) paper, but simplified to a 2D context. It focuses on the surface extraction component, assuming the Line Segment Field (LSF) is given, unlike related work on generating LSFs (e.g., SALS Sec 3.4) or other contouring methods like Marching Cubes/Squares. The novelty for the student lies in the hands-on implementation and evaluation of this specific geometric algorithm.

Risk Factors & Limitations:

Requires understanding basic linear algebra and geometry for the QEF and normal calculations.
Implementation of the QEF solver and contour connection logic needs careful debugging.
Project assumes a predefined LSF; it does not involve generating the LSF itself.
Focus is limited to 2D, omitting 3D complexities.
Accuracy depends on the quality and resolution of the provided LSF data.

Testable Hypothesis:

Implementing the E-DC algorithm logic in 2D, using intersection data from a predefined Line Segment Field for synthetic shapes, will produce contours with a Hausdorff distance to the ground truth that is at least 20% lower than contours generated by a basic Marching Squares approach using the same intersection data, particularly for shapes with sharp corners.

2180.54

148.06

0.06

7

wJv4AIt4sK_undergraduate
wg3rBImn3O_undergraduate
OheAR2xrtb_undergraduate
qeXcMutEZY_undergraduate
SqZ0KY4qBD_undergraduate
Y6LPWBo2HP_undergraduate
cH65nS5sOz_undergraduate

Yes

learning on graphs and other geometries & topologies

undergraduate

This paper was rejected

Title: Replication Study: Evaluating the Impact of Degree-Corrected Benchmarking on Preferential Attachment vs. Common Neighbors Link Prediction

Abstract:

Link prediction benchmarks are crucial for evaluating graph algorithms, but the paper 'Implicit degree bias in the link prediction task' shows standard methods suffer from degree bias, favoring algorithms that exploit node popularity. This project aims to replicate and verify this finding by implementing two common link prediction algorithms, Preferential Attachment (PA, sensitive to degree) and Common Neighbors (CN, less sensitive), and evaluating them on a standard graph dataset using both the original (biased) and the degree-corrected benchmark proposed by the paper. We will compare the Area Under the ROC Curve (AUC-ROC) scores obtained under both benchmark conditions. We expect to observe a significant drop in PA's relative performance under the degree-corrected benchmark, confirming the paper's findings.

Experiments:

Select a standard, publicly available graph dataset (e.g., Zachary's Karate Club, Cora, CiteSeer) and load it using a Python library like NetworkX.
Implement the Preferential Attachment (PA) link prediction algorithm (score(u,v) = degree(u) * degree(v)).
Implement the Common Neighbors (CN) link prediction algorithm (score(u,v) = |Neighbors(u) intersect Neighbors(v)|).
Implement the standard link prediction benchmark setup: sample a fraction of edges as positive examples (E_pos), sample an equal number of non-edges uniformly as negative examples (E_neg_orig).
Implement the degree-corrected benchmark setup: use the same E_pos, sample an equal number of non-edges using degree-proportional sampling for negative examples (E_neg_corr) as described in the paper.
Calculate prediction scores for all positive and negative examples using PA and CN for both benchmark setups (original and corrected).
Compute the AUC-ROC score for PA and CN on both the original benchmark (E_pos vs E_neg_orig) and the degree-corrected benchmark (E_pos vs E_neg_corr) using libraries like Scikit-learn.
Compare the AUC-ROC scores and the relative performance difference between PA and CN across the two benchmark setups. Analyze the runtime of the different steps.

Related Work:

This project directly investigates the core claim of the paper 'Implicit degree bias in the link prediction task' regarding the effect of degree-corrected benchmarks. It involves implementing fundamental link prediction heuristics (Preferential Attachment, Common Neighbors) often discussed in network science literature. The novelty lies in the hands-on replication and comparative analysis of the benchmark's impact.

Risk Factors & Limitations:

Potential challenges in correctly implementing the degree-corrected negative sampling algorithm.
Choice of dataset might influence the magnitude of the observed effect.
Limited to only two simple link prediction algorithms.
Computational cost of evaluating all possible non-edges for scoring might be high for larger graphs (mitigation: sample a larger set of negatives initially).

Testable Hypothesis:

Evaluating Preferential Attachment (PA) and Common Neighbors (CN) on a standard graph dataset using the degree-corrected benchmark from the paper will yield a significantly lower relative AUC-ROC score for PA compared to CN, than when evaluated using the standard benchmark with uniform negative sampling.

2180.54

148.06

0.06

7

tjNf0L8QjR_undergraduate
alBn1uNTLi_undergraduate
BkeJro1xps_undergraduate
u9Z6gL5MlL_undergraduate
WhIuLQWCWS_undergraduate
IUmj2dw5se_undergraduate
8aKygnbEFX_undergraduate

Yes

undergraduate

This paper was rejected

Title: Empirical Comparison of BEAR and PCSE Exploration Strategies for Inverse Constraint Reinforcement Learning in Gridworld Environments

Abstract:

The paper 'Strategic Exploration for Inverse Constraint Inference' proposes two novel algorithms, BEAR and PCSE, for efficiently learning hidden constraints from expert behavior in reinforcement learning settings. This project aims to implement these two algorithms and compare their empirical performance against standard baseline exploration methods (like epsilon-greedy) in benchmark gridworld environments. The focus will be on evaluating sample efficiency (how quickly they learn) and the accuracy of the inferred constraints (using metrics like WGIoU mentioned in the paper's experiments). This provides practical insight into the effectiveness of these theoretically-grounded exploration strategies.

Experiments:

Set up standard gridworld environments (e.g., using OpenAI Gym or a custom implementation) similar to those used in the paper.
Implement the BEAR exploration strategy, focusing on reducing the bounded error of cost estimation.
Implement the PCSE exploration strategy, focusing on constraining exploration to potentially optimal policies.
Implement a baseline exploration strategy (e.g., epsilon-greedy or random exploration).
Train agents using each exploration strategy to infer constraints from simulated expert demonstrations in the gridworlds.
Compare the strategies based on learning speed (cumulative reward/cost curves over iterations), final performance, and accuracy of inferred constraints (e.g., WGIoU against ground truth).

Related Work:

This project directly implements and evaluates the BEAR and PCSE algorithms introduced in the input paper ('Strategic Exploration for Inverse Constraint Inference'). It focuses on empirically verifying the paper's claims about the efficiency of these strategic exploration methods compared to simpler baselines within the context of Inverse Constraint Reinforcement Learning (ICRL). It relates to fundamental concepts in reinforcement learning exploration strategies but specifically tests the novel approaches proposed for the ICRL problem.

Risk Factors & Limitations:

Debugging RL implementations can be time-consuming.
Performance might be sensitive to hyperparameter choices.
Gridworld environments are simplified representations of reality.
Focuses on implementation and empirical comparison, not theoretical analysis.
Requires understanding of core RL concepts and the paper's algorithms.

Testable Hypothesis:

In gridworld environments with hidden constraints, the PCSE exploration strategy will converge to an optimal policy satisfying inferred constraints using at least 20% fewer environment samples compared to the BEAR strategy and at least 50% fewer samples compared to an epsilon-greedy baseline, while achieving similar final constraint accuracy (WGIoU).

2180.54

148.06

0.06

7

bsnRUkVn63_undergraduate
ZACAKudvKW_undergraduate
qSEEQPNbu4_undergraduate
5IWJBStfU7_undergraduate
d4njmzM7jf_undergraduate
RfYD6v829Y_undergraduate
1poUSIGSCI_undergraduate

Yes

undergraduate

This paper was rejected

Title: Implementing and Analyzing Layer-wise Prediction Maps for Vision Transformer Explainability on CIFAR-10

Abstract:

Vision Transformers (ViTs) are powerful but complex. This project explores the 'Prediction Map' technique from the source paper to understand how ViTs make class predictions. We will implement Prediction Maps for a pre-trained ViT (e.g., ViT-Tiny/Small) on the CIFAR-10 dataset. The core task involves applying the ViT's classification head to patch tokens from different layers to generate class-specific heatmaps. We will visualize these maps, analyze how they evolve across layers, and compare them qualitatively to standard attention maps to understand the complementary information they provide.

Experiments:

Load a pre-trained ViT model (e.g., ViT-Small/16) suitable for CIFAR-10 using a library like `timm` or `transformers`.
Write Python code (using PyTorch/TensorFlow) to extract patch token embeddings from different layers (e.g., layers 3, 6, 9, 11) of the ViT for a given input image.
Apply the model's final classification head to the extracted patch tokens from each selected layer to generate raw Prediction Map scores for each class.
Visualize the Prediction Map for the predicted class and the ground-truth class for several example images (correctly and incorrectly classified).
Compare the Prediction Maps generated from different layers for the same image: How does the localization change?
Extract and visualize standard attention maps (e.g., attention from the [CLS] token in the last layer) and compare them visually against the Prediction Maps for the same images.

Related Work:

This project implements the core concept of 'Prediction Maps' introduced in 'From Attention to Prediction Maps'. It focuses on applying the paper's method for generating class-specific, gradient-free explanations by reusing the classification head on patch tokens. Unlike the full paper which also introduces PredicAtt fusion, this project focuses solely on generating and analyzing the basic Prediction Maps across layers, comparing them to standard ViT attention maps (e.g., Abnar & Zuidema, 2020) on the CIFAR-10 dataset. The novelty for the student lies in the implementation and comparative layer-wise analysis.

Risk Factors & Limitations:

Requires familiarity with Python and a deep learning framework (PyTorch/TensorFlow).
Access to a pre-trained ViT model and understanding its architecture is necessary.
Computational resources for running ViT (GPU recommended, e.g., Google Colab).
Analysis is primarily qualitative/visual.
Focuses only on Prediction Maps, not the PredicAtt combination method from the paper.
Limited to CIFAR-10 dataset.

Testable Hypothesis:

Implementing the Prediction Map technique for a ViT-Small model on CIFAR-10 will produce class-specific heatmaps where localization of the target object improves (becomes more focused and accurate) in deeper layers (e.g., layer 11 vs layer 3), and these maps will visually differ from standard last-layer attention maps for the same image.

2180.54

148.06

0.06

7

xom3YUQfbK_undergraduate
EDoD3DgivF_undergraduate
4GD7a9Bo9A_undergraduate
jDsmB4o5S0_undergraduate
4ikjWBs3tE_undergraduate
mFHPoYVeqN_undergraduate
UatDdAlr2x_undergraduate

Yes

undergraduate

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Title: Applying Supervised Feature Dictionaries to Evaluate Sparse Autoencoder Interpretability for Sentiment Analysis in GPT-2 Small

Abstract:

Makelov et al. (2024) introduced a method using supervised feature dictionaries to evaluate sparse autoencoders (SAEs) for interpretability in language models, demonstrating it on the IOI task. This project applies their core methodology to a different task: sentiment analysis in GPT-2 Small. We will identify key attributes related to sentiment (e.g., presence of positive/negative words, negation). Using a standard sentiment dataset (e.g., SST-2), we will compute supervised feature dictionaries based on these attributes from GPT-2 Small activations. We will then train a basic task-specific SAE and evaluate its ability to capture sentiment features using the F1 interpretability score proposed in the paper, comparing its latent features to the supervised dictionary.

Experiments:

Select a standard sentiment analysis dataset (e.g., Stanford Sentiment Treebank - SST-2) and process sentences suitable for GPT-2 Small.
Define sentiment attributes: e.g., 'contains_positive_word', 'contains_negative_word', 'contains_negation', 'overall_sentiment_positive/negative'. Create lists of positive/negative words.
Extract activations from a chosen layer/head in GPT-2 Small known to be relevant for sentiment (or MLP layer output) for the dataset prompts.
Implement the mean feature dictionary computation (conditional mean minus overall mean) for the defined sentiment attributes using the extracted activations.
Train a vanilla task-specific SAE on the extracted activations using a standard implementation.
Calculate F1 scores for SAE latents against the defined sentiment attributes (e.g., does a latent fire specifically for sentences containing positive words?).
Analyze the interpretability results: Did the SAE learn features corresponding to the supervised sentiment attributes?

Related Work:

This project directly applies the supervised evaluation methodology from Makelov et al. (2024) to a new domain (sentiment analysis). It builds on their concept of using task-specific attributes and supervised dictionaries as a benchmark for SAE interpretability. While sentiment analysis is a common task, this project specifically uses the Makelov et al. framework to evaluate SAE feature quality beyond reconstruction, focusing on F1 interpretability scores.

Risk Factors & Limitations:

Defining effective sentiment attributes might be challenging (e.g., handling sarcasm, context-dependent word polarity).
GPT-2 Small's sentiment representation might be distributed or complex, making disentanglement difficult for a simple SAE.
Identifying the most relevant activation location in GPT-2 Small for sentiment might require preliminary exploration or literature review.
Project scope focuses on F1 interpretability, potentially excluding control/sufficiency/necessity evaluations due to time constraints.
Access to sufficient compute for SAE training (though smaller scale than large models).

Testable Hypothesis:

A vanilla SAE trained on GPT-2 Small activations from a sentiment analysis task will contain specific latents that achieve a high F1 score (>0.7) for simple sentiment attributes (e.g., 'contains_positive_word', 'contains_negative_word') defined a priori.

2129.55

151.05

0.06

7

Ha6RTeWMd0_undergraduate
cD1kl2QKv1_undergraduate
qhfZL46nPV_undergraduate
SctfBCLmWo_undergraduate
YXewbZ8FgU_undergraduate
0QvLISYIKM_undergraduate
rLX7Vyyzus_undergraduate

Yes

interpretability and explainable AI

undergraduate

ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks

Title: Evaluating the Conflict-Free Inverse Gradients (ConFIG) Optimizer for Training Physics-Informed Neural Networks on the 1D Heat Equation

Abstract:

Training Physics-Informed Neural Networks (PINNs) is often difficult due to conflicting loss terms (PDE residual, boundary/initial conditions). The ConFIG paper proposes an algorithm to mitigate these conflicts. This project implements the ConFIG optimizer for a standard 1D time-dependent PDE (e.g., the 1D Heat Equation) using PyTorch. We will compare the performance of ConFIG against the standard Adam optimizer in terms of solution accuracy (compared to the analytical solution) and loss convergence speed. The goal is to gain hands-on experience with PINNs and advanced optimization techniques.

Experiments:

Set up a basic PINN using PyTorch to solve the 1D Heat Equation (u_t = alpha * u_xx) with defined initial and Dirichlet boundary conditions.
Implement the standard Adam optimizer approach, summing the PDE residual loss, initial condition loss, and boundary condition loss.
Implement the ConFIG optimizer (potentially starting with the M-ConFIG variant using the paper's provided code structure) to handle the three loss terms.
Train the PINN using both Adam and ConFIG for a fixed number of epochs or wall-clock time.
Compare the final solution accuracy against the analytical solution using L2 error.
Analyze and plot the convergence history of the individual loss terms (PDE, initial, boundary) for both optimizers.

Related Work:

Directly implements the ConFIG algorithm presented in the paper. Applies it to a standard benchmark problem (1D Heat Equation) often used for introducing PINNs (related to Raissi et al., 2019). The novelty for the student lies in implementing and comparing this specific advanced optimization method against the standard Adam baseline in the context of PINNs.

Risk Factors & Limitations:

Requires understanding of basic PDEs and PINN formulation.
Debugging PINN training can be challenging (e.g., loss balancing, network architecture choice).
Implementation of ConFIG requires careful handling of gradients and potentially linear algebra (pseudoinverse), though using the paper's code as a base helps.
Limited to a simple 1D PDE.
Assumes access to standard Python ML libraries (PyTorch/TensorFlow) and moderate compute time (CPU might suffice, GPU preferred).

Testable Hypothesis:

Using the ConFIG optimizer to train a PINN for the 1D Heat Equation will result in a >30% lower L2 error in the final solution compared to using the standard Adam optimizer after the same number of training epochs, while demonstrating more uniform convergence across PDE, initial condition, and boundary condition losses.

2067.99

148.06

0.06

6

84pDoCD4lH_undergraduate
nwDRD4AMoN_undergraduate
eIJfOIMN9z_undergraduate
jXLiDKsuDo_undergraduate
KijslFbfOL_undergraduate
dRXxFEY8ZE_undergraduate
N8Oj1XhtYZ_undergraduate

Yes

applications to physical sciences (physics, chemistry, biology, etc.)

undergraduate

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Title: Developing a Python Toolkit for Analyzing Error Types, Effective Accuracy, and Reasoning Gaps of Large Language Models on the UGMathBench Benchmark

Abstract:

Evaluating Large Language Model (LLM) mathematical reasoning requires detailed analysis beyond simple accuracy scores. The UGMathBench paper identified various error types and metrics like Effective Accuracy (EAcc) and Reasoning Gap (Δ). This project involves developing a Python-based analysis tool to process LLM outputs generated on the UGMathBench benchmark. The tool will implement functions to parse model responses, automatically classify common error types (e.g., calculation vs. reasoning flaws based on heuristics or keywords), calculate EAcc and Δ, and generate visualizations (e.g., bar charts, scatter plots) comparing model performance and error distributions across different mathematical subjects or LLMs.

Experiments:

Develop Python functions to parse LLM response files (assuming a standard format) and extract final answers.
Implement a rule-based or simple keyword-based error classification module to categorize incorrect responses into types like 'Calculation Error', 'Incorrect Formula/Logic', 'Misinterpretation', 'Format Error' based on analysis of example errors from the paper or a small sample.
Implement functions to calculate EAcc and Δ metrics as defined in the UGMathBench paper, given responses across multiple randomized versions.
Use libraries like Matplotlib or Seaborn to create visualizations comparing EAcc, Δ, and error type frequencies across 2-3 different LLMs (using publicly available results if necessary) or across 3-4 different UGMathBench subjects.
Test the tool on a subset of UGMathBench data and document its functionality and the insights derived from the visualizations.

Related Work:

This project directly builds upon the UGMathBench paper (Xu et al., ICLR 2025) by creating a tool to operationalize its evaluation metrics (EAcc, Δ) and error analysis concepts. It relates to the broader field of LLM evaluation and interpretation tools. Unlike the original paper which presented the benchmark and initial findings, this project focuses on developing a reusable software tool for deeper, comparative analysis of LLM mathematical reasoning errors and robustness.

Risk Factors & Limitations:

Requires proficiency in Python and data analysis libraries.
Automatic error classification accuracy may be limited, requiring robust heuristics or manual validation.
Access to consistent LLM output data for UGMathBench might be needed (or use published results).
The tool's scope might be limited to specific answer formats present in UGMathBench.
Visualization design requires careful consideration for clarity.

Testable Hypothesis:

A Python tool can be developed to automatically calculate EAcc and Δ from UGMathBench results and classify errors with >70% agreement with manual classification on a sample set, revealing distinct error profiles between calculus and algebra problems for a specific LLM.

2067.99

148.06

0.06

6

P42DbV2nuV_undergraduate
HD6bWcj87Y_undergraduate
AEFVa6VMu1_undergraduate
kxnoqaisCT_undergraduate
wkHcXDv7cv_undergraduate
eajZpoQkGK_undergraduate
SI2hI0frk6_undergraduate

Yes

datasets and benchmarks

undergraduate

Provably Accurate Shapley Value Estimation via Leverage Score Sampling

Title: Empirical Comparison of Leverage SHAP and Kernel SHAP for Efficient and Accurate Feature Importance Estimation

Abstract:

Shapley values help explain AI predictions, but standard methods like Kernel SHAP can be slow. The Leverage SHAP paper proposes a faster method using 'leverage score' sampling. This project involves implementing both Leverage SHAP and a baseline Kernel SHAP using Python. We will compare their performance (accuracy of Shapley value estimates vs. ground truth, computational time) on 2-3 standard benchmark datasets (e.g., Iris, Diabetes, California Housing from UCI repository). Additionally, we will analyze properties of these datasets (e.g., feature correlation, number of features) to understand when Leverage SHAP provides the most significant advantages.

Experiments:

Implement the Leverage SHAP algorithm (including leverage score sampling and paired sampling) using Python libraries like NumPy/SciPy.
Implement a baseline Kernel SHAP algorithm (or use a simplified version from a library for comparison).
Select 2-3 benchmark classification/regression datasets (e.g., from UCI or scikit-learn). Train simple models (e.g., Logistic Regression, Decision Tree) on them.
Compute ground truth Shapley values (feasible for small n or specific models like trees) or use TreeSHAP as a reference for comparison.
Run Leverage SHAP and Kernel SHAP to estimate Shapley values for the trained models on test instances, varying the number of samples (m).
Compare the l2-norm error (vs. ground truth/reference) and wall-clock time for both methods across different datasets and sample sizes (m).
Analyze dataset characteristics (e.g., feature correlation matrix, number of features 'n') and relate them to the observed performance differences between the two SHAP methods.

Related Work:

This project implements and empirically evaluates the Leverage SHAP algorithm introduced by Musco & Witter. It directly compares this method to the established Kernel SHAP algorithm (Lundberg & Lee, 2017), focusing on the practical trade-offs between speed and accuracy highlighted in the source paper. The project extends the paper's experiments by analyzing dataset properties that might influence the relative performance.

Risk Factors & Limitations:

Implementing leverage score sampling correctly requires careful attention.
Obtaining exact ground truth Shapley values for comparison can be challenging for some models/datasets.
Performance comparisons might be sensitive to implementation details and hardware.
Requires understanding of basic ML models and evaluation metrics.
Focuses on comparison, assumes correctness of the underlying models.

Testable Hypothesis:

Leverage SHAP will achieve a lower l2-norm error for a fixed computational budget (time limit) compared to Kernel SHAP when estimating Shapley values on the selected benchmark datasets (n < 20), particularly on datasets with higher feature correlation.

2067.99

148.06

0.06

6

R22JPTQYWV_undergraduate
QWunLKbBGF_undergraduate
RavSZTIe2s_undergraduate
UvMSKonce8_undergraduate
JDm7oIcx4Y_undergraduate
tnB94WQGrn_undergraduate
KYOdZRR6nr_undergraduate

Yes

interpretability and explainable AI

undergraduate

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Title: Evaluating Stereotyping Biases in Recent Open-Source Language Models using the CEB Direct Evaluation Framework

Abstract:

The CEB paper introduced a benchmark for evaluating fairness in Large Language Models (LLMs). This project aims to apply a specific subset of the CEB evaluation methodology to assess potential biases in two recent, accessible open-source LLMs (e.g., Mistral-7b, Llama3-8b) that were not extensively covered in the original paper. We will focus on the 'Stereotyping' bias type using the 'Recognition' and 'Selection' direct evaluation tasks for 'Gender' and 'Race' social groups. Experiments will involve running the relevant CEB dataset prompts through the target LLMs and calculating Micro-F1 scores. Results will be compared to baselines reported in the CEB paper to understand how bias profiles may differ in newer models.

Experiments:

Set up a Python environment with necessary libraries (e.g., Hugging Face Transformers, Pandas).
Select and obtain access to two target open-source LLMs (e.g., Mistral-7b, Llama3-8b, confirming accessibility via APIs like Groq/Fireworks AI or feasible local setup).
Download or select the relevant CEB datasets: CEB-Recognition-S and CEB-Selection-S, filtering for 'Gender' and 'Race' social group samples.
Implement scripts to run the filtered CEB prompts through the target LLMs and collect their responses.
Calculate the Micro-F1 scores for each LLM on the Recognition and Selection tasks for the chosen groups.
Compare the obtained scores against relevant baseline scores reported in Table 4 & 6 of the CEB paper.
Write a report summarizing the methodology, results, comparisons, and potential reasons for observed differences.

Related Work:

This project directly applies the methodology and subsets of the datasets from the CEB paper (Wang et al., 2025). It extends the paper's analysis by evaluating newer open-source LLMs (e.g., Mistral-7b, Llama3-8b) not extensively studied in the original work. The novelty lies in providing updated benchmark results for these specific models, contributing to the ongoing evaluation landscape described in surveys like Gallegos et al. (2023).

Risk Factors & Limitations:

Accessibility and computational cost of running the selected LLMs (mitigated by choosing smaller models or using APIs).
Potential difficulties in exactly replicating the prompting strategy of the original paper.
Limited scope (only 2 bias types, 2 tasks, 2 groups, 2 models).
Focuses on direct evaluation tasks, excluding more complex indirect tasks.
Requires careful handling of potentially offensive dataset content.

Testable Hypothesis:

Evaluating two specific open-source LLMs (e.g., Mistral-7b, Llama3-8b) using the CEB direct evaluation tasks (Recognition, Selection) for Gender and Race stereotypes will reveal statistically significant differences (p < 0.05) in Micro-F1 scores compared to at least one baseline model reported in the CEB paper (e.g., Llama2-7b).

2067.99

148.06

0.06

6

zxqdVo9FjY_undergraduate
K9zedJlybd_undergraduate
xOmC5LiVuN_undergraduate
3ogIALgghF_undergraduate
zvaiz3FjA9_undergraduate
feFlfuOse1_undergraduate
gY08Ou8EL7_undergraduate

Yes

datasets and benchmarks

undergraduate

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Title: Investigating the Impact of Activation Functions on Circuit Multiplicity in MLPs Trained for XOR

Abstract:

The source paper demonstrates that multiple 'circuits' (subnetworks) within a small MLP can perfectly compute the XOR function, illustrating the non-identifiability of mechanistic explanations. This project aims to replicate and extend this finding by investigating how the choice of activation function influences this circuit multiplicity. We will train several small MLPs (e.g., 2 hidden layers of size 3) to compute XOR, each using a different activation function (e.g., ReLU, Sigmoid, Tanh, GeLU). We will then implement a simplified version of the paper's 'where-then-what' circuit search strategy to enumerate circuits with perfect accuracy (zero circuit error) for each trained MLP. The project will compare the number and basic properties (e.g., size/sparsity) of perfect circuits found across the different activation functions.

Experiments:

Implement MLP training code in Python (using libraries like PyTorch or TensorFlow/Keras) to train networks on the XOR task with different activation functions (ReLU, Sigmoid, Tanh).
Implement the 'circuit search' procedure described in the paper (Sec 3): Enumerate subgraphs of the trained MLPs.
Implement the 'circuit error' evaluation (Def 4): Test each subgraph's ability to replicate the XOR function on validation data.
Run the circuit search and evaluation for MLPs trained with each activation function, identifying all 'perfect circuits' (zero circuit error).
Compare the number and average size (number of nodes/edges) of perfect circuits found for each activation function and present results visually (e.g., bar charts).

Related Work:

This project directly replicates the experimental setup (XOR function, small MLP) and methodology ('where-then-what' circuit search based on circuit error) from the source paper ('Everything, Everywhere...'). It extends the paper's investigation by systematically varying the activation function, a factor known to influence network behavior but not explicitly studied in the context of circuit multiplicity in the paper's experiments, thus probing the robustness of the non-identifiability finding.

Risk Factors & Limitations:

Computational cost of circuit enumeration, even for small MLPs, might require efficient implementation.
Difficulty in perfectly replicating the paper's exact circuit search details without access to their code.
Potential for finding very large numbers of circuits, requiring summarization.
Focuses only on the XOR task and small MLPs.
Assumes basic proficiency in Python and ML libraries.

Testable Hypothesis:

Training identical MLP architectures on the XOR task using different activation functions (e.g., Sigmoid vs. ReLU vs. Tanh) will result in a statistically significant difference in the number of subnetworks ('perfect circuits') capable of perfectly replicating the XOR function.

2067.99

148.06

0.06

6

bWT6OBJ71x_undergraduate
Yt9CFhOOFe_undergraduate
Tl8EzmgsEp_undergraduate
AqueuvXErD_undergraduate
lOfuvmi2HT_undergraduate
2jzhImk4br_undergraduate
4011PUI9vm_undergraduate

Yes

interpretability and explainable AI

undergraduate

This paper was rejected

Title: Replicating and Extending the Analysis of Positional Bias in Text Embeddings using Scientific Abstracts and Comparing APE, RoPE, and ALiBi Models

Abstract:

The 'Bias Learning' paper found that text embedding models often overweight the beginning of text. This project aims to replicate and extend this finding by investigating positional bias in the domain of scientific abstracts. We will collect abstracts from a specific field (e.g., Computer Science from arXiv), generate sentence embeddings using pre-trained models with different positional encoding techniques (APE, RoPE, ALiBi), and perform regression analysis as described in the paper to quantify sentence importance based on position. We will compare the extent of positional bias across the different models and analyze if the structure of scientific abstracts influences the bias patterns. This provides practical experience with embedding models and regression analysis.

Experiments:

Collect a dataset of ~500 scientific abstracts from a specific category on arXiv using its API.
Pre-process the abstracts: segment into sentences.
Generate sentence embeddings and full-abstract embeddings for each abstract using 3 pre-trained models: one APE-based (e.g., BGE-base), one RoPE-based (e.g., E5-RoPE-base), and one ALiBi-based (e.g., Jina-Embeddings-v2-base). Use libraries like sentence-transformers.
Implement the Ordinary Least Squares (OLS) regression analysis from the paper: regress sentence embeddings onto the full abstract embedding for each abstract and model.
Normalize regression coefficients and plot them against relative sentence position, similar to Figure 2 in the paper.
Compare the positional bias trends (correlation coefficients, visual plots) across the three models using statistical summaries.

Related Work:

This project directly replicates the regression analysis methodology from the 'Bias Learning' paper to investigate positional bias. It extends the paper's work by applying the analysis to a new domain (scientific abstracts) and explicitly comparing models with different positional encoding strategies (APE, RoPE, ALiBi) on this specific data type.

Risk Factors & Limitations:

Potential challenges in reliably segmenting abstracts into meaningful sentences.
Access limitations or rate limits of the arXiv API might require alternative data sourcing.
Requires familiarity with Python libraries like sentence-transformers, scikit-learn, and numpy/pandas.
Interpretation of regression coefficients requires careful consideration, results might not show strong bias.
Focuses on analysis, does not explore mitigation strategies.

Testable Hypothesis:

Applying the regression analysis from the 'Bias Learning' paper to scientific abstracts will reveal a statistically significant negative correlation between sentence position and its regression coefficient (importance weight) for APE and RoPE models, while the ALiBi model will exhibit significantly less positional bias.

2067.99

148.06

0.06

6

tcsZt9ZNKD_undergraduate
9ca9eHNrdH_undergraduate
dVLrqe2a7c_undergraduate
XrsOu4KgDE_undergraduate
fn0mjkZopf_undergraduate
SzWvRzyk6h_undergraduate
bVTM2QKYuA_undergraduate

Yes

undergraduate

Language Representations Can be What Recommenders Need: Findings and Potentials

Title: Evaluating Linearly Mapped Language Model Embeddings for Movie Recommendation Against a Collaborative Filtering Baseline

Abstract:

Language models (LMs) encode rich information. The source paper suggests this includes user preference signals useful for recommendations. This project investigates this by implementing a simple recommendation model using pre-trained LM embeddings of movie titles. We will obtain embeddings for MovieLens-100k titles using a standard LM (e.g., DistilBERT via sentence-transformers). A linear mapping layer will be trained to project these embeddings into a 'behavior space' predicting user ratings or interactions. Performance will be compared against a standard collaborative filtering baseline (e.g., KNNBasic from Surprise library) using metrics like RMSE and NDCG@10. This explores the feasibility of leveraging LM semantics for recommendation as highlighted in the paper.

Experiments:

Preprocess the MovieLens-100k dataset (user ratings, item titles).
Generate item title embeddings using a pre-trained sentence transformer model (e.g., 'all-MiniLM-L6-v2').
Implement a simple collaborative filtering baseline using the Surprise library (e.g., KNNBasic or SVD).
Implement the LM-based model: Train a linear regression or shallow MLP model using Scikit-learn to predict ratings, taking item embeddings as input (user representation can be averaged item embeddings of rated items, as in paper). Alternatively, frame as interaction prediction and train linear layer mapping item embeddings to a latent space, using a simple loss like BPR or InfoNCE if feasible.
Evaluate both models using appropriate metrics (e.g., RMSE for rating prediction, or Recall@10/NDCG@10 for interaction prediction) via cross-validation.
Compare the performance of the LM-based approach against the CF baseline.
Analyze cases where the LM approach performs better or worse than the baseline.

Related Work:

This project implements a simplified version of the 'Linear Mapping' approach explored in Section 3 and Figure 1 of the source paper. It tests the paper's hypothesis that LM representations can be linearly projected into an effective behavior space for recommendation, comparing it against traditional collaborative filtering (CF) methods like Matrix Factorization or KNN (baseline mentioned in paper Figure 1b). It uses standard tools like sentence-transformers and the Surprise library for accessibility.

Risk Factors & Limitations:

Requires familiarity with Python programming and libraries (Pandas, Scikit-learn, sentence-transformers, Surprise).
Computational resources needed for embedding generation and model training (manageable for smaller LMs/datasets).
Performance of the simple linear mapping might be limited compared to the paper's full AlphaRec.
Focuses only on item titles, ignoring other potential content features.
User representation (averaging embeddings) is basic.

Testable Hypothesis:

A recommendation model using item title embeddings from a pre-trained language model (all-MiniLM-L6-v2) projected via a trained linear layer will achieve a statistically significant improvement in NDCG@10 compared to a KNNBasic collaborative filtering baseline on the MovieLens-100k dataset.

2052.00

148.06

0.06

6

Nx4PMtJ1ER_undergraduate
gYWqxXE5RJ_undergraduate
VOAMTA8jKu_undergraduate
APojAzJQiq_undergraduate
PbheqxnO1e_undergraduate
xQBRrtQM8u_undergraduate
I4e82CIDxv_undergraduate

Yes

other topics in machine learning (i.e., none of the above)

undergraduate

Approximation algorithms for combinatorial optimization with predictions

Title: Empirical Evaluation of a Learning-Augmented Greedy Algorithm for the Set Cover Problem

Abstract:

Fast approximation algorithms often sacrifice solution quality. Antoniadis et al. (2024) proposed using predictions to improve these algorithms. This project implements and evaluates their general black-box method (Theorem 1) for the classic NP-hard Set Cover problem. We will implement the standard greedy approximation algorithm for Set Cover and the proposed learning-augmented modification. Using benchmark Set Cover instances, we will generate simulated predictions of varying quality (controlling false positives `η+` and false negatives `η-`). Experiments will compare the solution quality (size of the set cover) and potentially runtime of the prediction-augmented algorithm against the baseline greedy algorithm, analyzing how performance degrades with increasing prediction error. The expected outcome is an empirical validation of the learning-augmented approach for Set Cover.

Experiments:

Implement the standard greedy approximation algorithm for the Set Cover problem in Python.
Implement the black-box modification described in Theorem 1: given a prediction set `X`, modify item weights (set weights of items in `X` to 0) before running the greedy algorithm.
Source several standard Set Cover benchmark instances (e.g., from OR-Library).
Develop a method to simulate predictions `X` based on the known optimal or near-optimal solutions for benchmarks, allowing control over the error rates (`η+`, `η-`).
Run the baseline greedy and the learning-augmented greedy algorithm on the benchmarks with varying levels of prediction error.
Compare the sizes of the set covers obtained and analyze the relationship between the approximation ratio achieved and the prediction error (`η+`, `η-`).
Optionally, compare runtimes.

Related Work:

This project directly implements the general framework for minimization problems (Theorem 1) from Antoniadis et al. (2024), applying it specifically to the Set Cover problem using the standard greedy algorithm (which has approximation ratio `ρ ≈ ln |Universe|`). It relates to classic work on Set Cover approximations. The novelty lies in the empirical implementation and evaluation of the *learning-augmented* version for Set Cover, analyzing its performance sensitivity to controlled prediction errors, which is not explicitly detailed for Set Cover in the paper.

Risk Factors & Limitations:

Finding benchmark instances with known optimal solutions might be difficult; may need to use heuristic solutions as ground truth for simulation.
The impact of predictions might be marginal if the baseline greedy algorithm already performs well on the chosen instances.
Complexity in accurately simulating and measuring `η+` and `η-` relative to OPT.
Implementation requires careful handling of data structures for efficiency.
Focuses on one specific algorithm (greedy) for Set Cover.
Excludes developing actual ML models for prediction generation.

Testable Hypothesis:

Applying the prediction-augmented framework (Theorem 1) to the greedy Set Cover algorithm on benchmark instances, using simulated predictions where the total error weight (`η+ + (ρ-1)η-`) is less than 20% of the baseline solution cost, will yield set covers that are consistently >10% smaller than those found by the baseline greedy algorithm.

2052.00

148.06

0.06

6

IwPXYk6BV9_undergraduate
T1OvCSFaum_undergraduate
fovPyqPcKY_undergraduate
07yvxWDSla_undergraduate
OheAR2xrtb_undergraduate
FAfxvdv1Dy_undergraduate
msEr27EejF_undergraduate

Yes

optimization

undergraduate

This paper was rejected

Title: A Comparative Study of Gymnasium's Object-Oriented vs. Functional APIs for Implementing Policy Gradient Algorithms

Abstract:

The Gymnasium paper introduces both a standard object-oriented API (`Env`) and a newer functional API (`FuncEnv`). This project aims to compare these two interfaces in practice. We will implement two standard policy gradient algorithms, REINFORCE and Actor-Critic (A2C), using both the `Env` and `FuncEnv` APIs. These implementations will be tested on classic control environments (CartPole-v1, Acrobot-v1). The comparison will focus on qualitative aspects (ease of implementation, code structure) and quantitative metrics (runtime performance, learning curves). This project provides hands-on experience with different Gymnasium APIs and evaluates their practical trade-offs for implementing common RL algorithms.

Experiments:

Implement the REINFORCE algorithm using Gymnasium's standard `Env` API to solve CartPole-v1.
Implement the REINFORCE algorithm using Gymnasium's `FuncEnv` API (potentially requiring a wrapper for compatibility if the chosen algorithm library expects an Env object, or implementing the core logic functionally) to solve CartPole-v1.
Repeat steps 1 and 2 for the Actor-Critic (A2C) algorithm and potentially a second environment like Acrobot-v1.
Measure and compare the wall-clock time required for training (e.g., for a fixed number of steps or episodes) between the `Env` and `FuncEnv` implementations for each algorithm/environment pair.
Plot and compare the learning curves (e.g., average reward per episode) for all experimental conditions.
Qualitatively assess the code complexity and structure required for each API implementation.

Related Work:

This project directly engages with the Gymnasium paper by comparing two key API structures it provides (`Env`, `FuncEnv`). It relates to standard implementations of REINFORCE and Actor-Critic algorithms, often used as baselines in RL research. The novelty lies not in the algorithms themselves, but in the direct comparison of implementing them using Gymnasium's different interfaces, providing practical insights into the library's design choices.

Risk Factors & Limitations:

Requires understanding of basic policy gradient algorithms (REINFORCE, A2C).
FuncEnv implementation might require careful state management or wrapping depending on approach.
Performance differences might be subtle on simple environments/algorithms.
Focuses on implementation and basic performance, not deep algorithmic theory.
Results might depend on specific implementation choices and hardware.
Excludes vectorization aspects initially for simplicity.

Testable Hypothesis:

Implementing REINFORCE and A2C using Gymnasium's `FuncEnv` API for CartPole-v1 will result in code that is qualitatively different (e.g., more functional style) but achieves statistically similar learning performance (final average reward) and comparable wall-clock training time compared to implementations using the standard `Env` API on consumer hardware.

2052.00

148.06

0.06

6

TJU9J8iQXL_undergraduate
gvZpk0n68q_undergraduate
UvMSKonce8_undergraduate
IUmj2dw5se_undergraduate
yPyb2j7oZc_undergraduate
M8xtZuxqC5_undergraduate
4ihkxIeTFH_undergraduate

Yes

undergraduate

This paper was rejected

Title: Re-Examining Natural Memorization Dynamics in Natural Language Processing Models

Abstract:

The paper 'Back to Fundamentals' challenges previous findings by showing that for 'natural' data, increasing model size (over-parameterization) and training time can actually *decrease* memorization in deep learning models, using computer vision benchmarks. This project aims to test the generality of these findings in the Natural Language Processing (NLP) domain. We will replicate the core experiments from the paper, measuring natural memorization scores, but using standard NLP datasets (e.g., IMDb sentiment classification) and relevant architectures (e.g., LSTM, a simple Transformer). We will investigate how memorization changes with varying model complexity (e.g., small vs. larger LSTM/Transformer) and across training epochs. The expected outcome is to confirm or contrast the paper's findings in the NLP domain.

Experiments:

Select a standard NLP classification dataset (e.g., IMDb reviews, AG News).
Implement and train NLP models of varying complexity on the dataset (e.g., a simple LSTM, a larger LSTM, a basic Transformer model like DistilBERT).
Adapt and implement the methodology from Feldman & Zhang (2020), as used in the paper, to estimate natural memorization scores for samples in the NLP dataset across different models and training epochs.
Plot the number of memorized points vs. model size (number of parameters) for the NLP models, similar to Figure 1 in the paper.
Plot the number of memorized points vs. training epochs for a selected NLP model, similar to Figure 2 in the paper.
Compare the trends observed in the NLP domain with those reported for image datasets in the paper.

Related Work:

This project directly replicates the experimental setup of 'Back to Fundamentals: Re-Examining Memorization...' but applies it to the NLP domain. The paper primarily used image datasets (CIFAR, Tiny ImageNet) and vision architectures (VGG, ResNet, ViT). This project uses text data and sequence models (LSTM, Transformer) to test if the core findings—that natural memorization decreases with over-parameterization and extended training—hold true across different data modalities and model families, providing evidence for the generality of the paper's claims.

Risk Factors & Limitations:

Estimating memorization scores can be computationally intensive, potentially requiring efficient sampling or access to moderate GPU resources.
Adapting the leave-one-out methodology proxy for sequence data might require careful implementation.
NLP models might exhibit different memorization patterns compared to vision models.
Requires careful hyperparameter tuning for fair comparison across models.
Focuses on specific datasets/architectures; findings might not generalize to all NLP tasks.

Testable Hypothesis:

Applying the natural memorization analysis from 'Back to Fundamentals' to standard NLP classification tasks will show that (1) larger NLP models (more parameters) exhibit less natural memorization than smaller ones, and (2) natural memorization rates decrease after an initial peak as training progresses, consistent with the paper's findings in computer vision.

2052.00

148.06

0.06

6

j1jtyGdD4O_undergraduate
nTlzEM1x3B_undergraduate
K9zedJlybd_undergraduate
o6ddWvoyjK_undergraduate
FXPsZ6cbUj_undergraduate
gWrWUaCbMa_undergraduate
1poUSIGSCI_undergraduate

Yes

undergraduate

This paper was rejected

Title: Evaluating the Sensitivity of Initialization Completeness Effects to Hyperparameters in the Multi-Digit XOR Task

Abstract:

The paper 'Interpretable Patterns in Random Initialization...' suggests that the way a neural network is randomly initialized influences the final solution it learns, proposing a 'completeness' metric for tasks like multi-digit XOR. This project aims to replicate this finding for the 3-bit XOR task and extend it by systematically investigating how sensitive this relationship is to common training hyperparameters. We will implement the XOR task, calculate the initial 'completeness' signal `g(a)` for various initializations, train MLP models, and analyze if the initial signal predicts the learned representation. We will then repeat this analysis while varying weight decay and learning rate to test the robustness of the paper's findings.

Experiments:

Replicate XOR Setup: Implement a simple MLP (e.g., using PyTorch/TensorFlow) for the 3-bit XOR task (p=8) as described in the paper (Sec 2.2, 3.2).
Implement Completeness Metric: Implement the calculation of the initial 'completeness' signal `g(a)` based on initial embedding distances and the target hypercube structure (Eq in Sec 3.2).
Baseline Replication: Run multiple training trials (e.g., 30-50) with different random seeds using the paper's default hyperparameters. For each trial, calculate the initial `g(a)` for the eventually learned representation and compare its rank/value against the distribution of `g(a)` for all possible representations (similar to Fig 2b).
Hyperparameter Sweep 1 (Weight Decay): Repeat the analysis from step 3, but systematically vary the weight decay parameter across a defined range (e.g., 0.1, 1, 3, 10). Analyze if the correlation between initial `g(a)` and learned representation changes.
Hyperparameter Sweep 2 (Learning Rate): Repeat the analysis, varying the learning rate (e.g., 0.1, 0.01, 0.001) while keeping weight decay fixed.
Analyze and Visualize Results: Report the correlation/ranking results for each hyperparameter setting, potentially visualizing the embedding space (e.g., using PCA) for selected runs.

Related Work:

This project directly builds upon the methodology and findings presented in Section 3.2 (Multi-digit XOR) of the paper 'Interpretable Patterns in Random Initialization...'. It aims to replicate Figure 2 and test the robustness of the 'completeness' hypothesis to hyperparameter variations, a standard practice in empirical ML validation. It relates to broader concepts of hyperparameter tuning and sensitivity analysis in deep learning. The novelty lies in the systematic evaluation of hyperparameter effects on the specific initialization-representation link identified in the paper.

Risk Factors & Limitations:

Metric Implementation: Correctly implementing the `g(a)` metric, especially the reference hypercube distances (`A_i,j`), requires care.
Computational Time: Running multiple training runs for each hyperparameter setting requires adequate time (though XOR is small).
Interpreting Results: Hyperparameter effects might be subtle or interact in complex ways.
Limited Scope: Focuses only on the XOR task and a few hyperparameters.
Assumes access to standard Python ML libraries (PyTorch/TensorFlow, NumPy, Matplotlib).

Testable Hypothesis:

The high correlation (>80th percentile rank on average) between the initial 'completeness' signal `g(a)` and the learned representation for the 3-bit XOR task, as reported in the paper, will persist across variations in weight decay (0.1-10) and learning rate (0.001-0.1), demonstrating the robustness of the finding, although the strength of the correlation may vary.

2052.00

148.06

0.06

6

eciCtsqGc8_undergraduate
rGyi8NNqB0_undergraduate
Ni4jNyroJZ_undergraduate
5IWJBStfU7_undergraduate
Q3RoP5IhHy_undergraduate
4ikjWBs3tE_undergraduate
zbpzJmRNiZ_undergraduate

Yes

undergraduate

This paper was rejected

Title: Evaluating Predefined FuRud-Inspired Fuzzy Corrections for Debiasing In-Context Learning on a Novel Text Classification Task

Abstract:

In-Context Learning (ICL) with Large Language Models (LLMs) can produce biased results in classification tasks. The FuRud paper proposed using fuzzy rules to correct output probabilities interpretably. This project implements the *correction* mechanism inspired by FuRud, applying predefined triangular membership functions (from the paper) to adjust probabilities generated by a standard LLM API for a new text classification dataset (e.g., SST-2 sentiment analysis, or a different news categorization task). We will measure the impact of these corrections on overall accuracy and pairwise class accuracy bias (COBias), comparing results before and after applying the FuRud-inspired transformations.

Experiments:

Select a suitable multi-class text classification dataset not used in the original paper (e.g., SST-2, AGNews subset, a Kaggle dataset).
Use a publicly available LLM API (e.g., OpenAI API free tier, Cohere, or a local Hugging Face model like DistilBERT fine-tuned for classification) to get baseline class probabilities for the test set using 1-shot ICL prompting.
Implement Python functions representing several predefined triangular membership functions described in FuRud (e.g., f7, f11, f16 from Fig 2 and Eq 2). Assign specific functions to specific classes hypothetically (or based on observed baseline biases).
Apply the implemented functions to transform the baseline probabilities according to FuRud's correction method (Eq 3).
Calculate Accuracy and COBias (using provided formulas or libraries) on the baseline probabilities and the transformed probabilities.
Compare the metrics and analyze the effect of the FuRud-inspired correction on the chosen dataset.

Related Work:

Directly applies the core correction mechanism (fuzzy membership functions) proposed in the FuRud paper. Unlike the original paper, it uses predefined functions instead of optimizing them. Relates to practical ML evaluation, bias measurement in NLP, and working with LLM APIs. Novelty lies in applying the technique to a new dataset and analyzing its effectiveness in that specific context.

Risk Factors & Limitations:

Requires access to an LLM API or sufficient local compute for a smaller model.
Predefined function assignment is heuristic, not optimized, and might not improve results significantly.
Limited exploration of different fuzzy rules or optimization.
Focuses only on the correction application, not the rule learning process from FuRud.
Potential variability in results depending on the chosen dataset and LLM.

Testable Hypothesis:

Applying predefined FuRud-inspired fuzzy rule transformations (e.g., amplifying low probabilities for underperforming classes, reducing high probabilities for overperforming ones) to the 1-shot ICL output probabilities from LLM L on dataset D will reduce the COBias metric by at least 5% compared to the baseline probabilities, while maintaining overall accuracy within +/- 2%.

2045.23

148.33

0.06

6

lGDmwb12Qq_undergraduate
1Njl73JKjB_undergraduate
mDV36U4d6u_undergraduate
bWz8aOPwsJ_undergraduate
iTjSqQQ4f8_undergraduate
kTjEPEy96Q_undergraduate
XrsOu4KgDE_undergraduate

Yes

undergraduate

Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

Title: Implementing and Evaluating the SemInfo Metric for Ranking Parse Tree Quality on Synthetic Data

Abstract:

Why do some ways of structuring a sentence feel more 'meaningful' than others? The SemInfo paper suggests a method to quantify this 'semantic information'. This project implements the core SemInfo calculation based on analyzing substrings in pre-generated paraphrases. We will use a provided synthetic dataset with sentences having known 'good' and 'bad' parse structures. We will calculate SemInfo scores for both parse types and compare if SemInfo consistently assigns higher scores to the good parses. We will also calculate basic Log-Likelihood scores from a simple PCFG trained on the synthetic data and compare its ability to rank parses against SemInfo's. This provides hands-on experience with semantic analysis concepts and evaluating parsing objectives.

Experiments:

Familiarize with the concept of constituency parsing and PCFGs (using provided resources/simple libraries like NLTK).
Implement a Python function to identify maximal common substrings between a sentence and its provided paraphrases (from the synthetic dataset).
Implement the Probability-Weighted Information (PWI) calculation (simplified version focusing on frequency in paraphrases and inverse sentence frequency within the small synthetic dataset) for the identified substrings.
Calculate the total SemInfo score for given 'good' and 'bad' parse trees (provided with the synthetic data) by summing the PWI of their constituent substrings.
Train a basic MLE-based PCFG on the synthetic sentences and calculate the Log-Likelihood (LL) score for the 'good' and 'bad' parse trees.
Compare the ability of SemInfo scores vs. LL scores to correctly rank the 'good' parses above the 'bad' parses across the synthetic dataset, reporting ranking accuracy.

Related Work:

This project implements and analyzes the core SemInfo calculation proposed in 'Improving Unsupervised Constituency Parsing via Maximizing Semantic Information'. It focuses on validating the paper's claim that SemInfo better reflects parse quality than Log-Likelihood (LL), using a simplified setting with synthetic data and pre-defined parses. It relates to basic concepts in parsing evaluation and probabilistic models (PCFGs). The novelty for the student lies in implementing the SemInfo metric and performing a direct comparison against the traditional LL objective.

Risk Factors & Limitations:

Relies on a pre-defined synthetic dataset and paraphrases; doesn't involve paraphrase generation.
Uses a simplified PWI calculation suitable for small data.
Does not involve implementing the full PCFG training or TreeCRF optimization from the paper.
Evaluation is limited to ranking pre-defined 'good' vs. 'bad' parses, not full parsing accuracy.
Requires understanding Python and basic probability concepts.

Testable Hypothesis:

The calculated SemInfo score will correctly rank the predefined 'good' parse structure higher than the 'bad' parse structure for >85% of sentences in the provided synthetic dataset, demonstrating significantly better ranking performance than Log-Likelihood scores derived from a baseline PCFG trained on the same data.

2031.51

148.06

0.06

6

GGlpykXDCa_undergraduate
N8Oj1XhtYZ_undergraduate
kbjJ9ZOakb_undergraduate
fU8H4lzkIm_undergraduate
pISLZG7ktL_undergraduate
U67J0QNtzo_undergraduate
dRXxFEY8ZE_undergraduate

Yes

applications to computer vision, audio, language, and other modalities

Kickstart Your AI Research Journey with ICLR 2025 Project Ideas

How to Use This Resource:

Share Your Progress & Claim a Reward!