Level | Paper Title | Project Idea | Project Idea Reasoning | Rating | RD | Volatility | Score | Opponents | Active | Primary Area |
---|---|---|---|---|---|---|---|---|---|---|
undergraduate | $\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics | Title: Evaluating the Impact of Audio Data Augmentation on Avian Bioacoustic Classification using the BirdSet Benchmark Abstract: Data augmentation is crucial for improving model robustness, especially when training and testing distributions differ, as noted in the BirdSet paper (focal vs. soundscape). This project investigates the impact of common audio data augmentation techniques on avian bioacoustic classification performance using the BirdSet benchmark. We will train a standard Convolutional Neural Network (CNN) on a subset of BirdSet's training data (XCM) with and without specific augmentations (e.g., noise mixing, pitch shifting). Performance will be evaluated on a soundscape test set (e.g., NES) to quantify the effect of augmentation. Experiments:
Related Work: This project builds directly on the BirdSet paper by using its datasets (XCM, NES, VOX) and addressing the challenge of covariate shift between focal training data and soundscape test data. While the paper benchmarks models, it doesn't systematically evaluate various data augmentation strategies. This project fills that gap by applying standard augmentation techniques, discussed in general audio ML literature, specifically to the BirdSet context to improve model generalization. Risk Factors & Limitations:
Testable Hypothesis: Applying a combination of background noise mixing (using BirdSet's VOX data) and moderate pitch shifting during training will increase the AUROC score by at least 3% on the NES test set compared to a baseline CNN model trained without these augmentations on the BirdSet XCM subset. | This project is suitable for an undergraduate student with basic ML/Python knowledge. Following the debate feedback, it focuses on a practical, achievable task: data augmentation. It uses the provided BirdSet dataset (specifically the manageable XCM subset for training and a test set like NES) and standard libraries. The hypothesis is testable using common classification metrics. Experiments involve implementing common audio augmentations, training a standard CNN model, and comparing performance. This ensures the student gains hands-on experience with the ML workflow, data handling, and evaluation, aligning with typical undergraduate course objectives. The novelty lies in the systematic application and analysis of these techniques within the specific context of the BirdSet bioacoustics challenge. | 2180.54 | 148.06 | 0.06 | 7 |
| Yes | datasets and benchmarks |
undergraduate | Instance-dependent Early Stopping | Title: Implementation and Empirical Evaluation of Instance-dependent Early Stopping for Image Classification Abstract: Instance-dependent Early Stopping (IES) is proposed as a method to accelerate machine learning training by skipping backpropagation for 'mastered' instances. This project aims to implement and evaluate the IES technique using standard ML libraries (PyTorch/TensorFlow). The core approach involves modifying a standard image classification training loop to incorporate the IES logic based on second-order loss differences. Experiments will compare training time and accuracy of models trained with IES versus standard training and conventional early stopping on datasets like MNIST, Fashion-MNIST, and potentially a small domain-specific dataset (e.g., a subset of HAM10000 skin lesions). The expected outcome is a practical understanding of IES implementation and an empirical validation of its effectiveness on selected datasets. Experiments:
Related Work: This project directly implements the method described in the 'Instance-dependent Early Stopping' paper. It builds upon standard machine learning concepts like early stopping and optimization algorithms (e.g., SGD, Adam). The novelty lies in the hands-on implementation and empirical evaluation of the IES technique, potentially on a dataset not extensively covered in the original paper, providing practical validation. Risk Factors & Limitations:
Testable Hypothesis: Implementing Instance-dependent Early Stopping (IES) in a standard ML training pipeline will reduce the wall-clock training time by at least 10% compared to standard training on MNIST/Fashion-MNIST, while achieving a final test accuracy within 1% of the baseline model. | This project is suitable for an undergraduate student with basic ML knowledge and Python programming skills (PyTorch/TensorFlow). It involves implementing the core IES algorithm described in the paper, applying it to accessible image datasets (beyond just CIFAR-10 replication), and performing empirical comparisons. This provides hands-on experience with modifying training loops, experimental design, and performance evaluation – key skills in applied ML. The scope is manageable within a semester or two, using standard libraries and datasets. The novelty comes from applying and evaluating IES on a specific, perhaps less common, dataset relevant to the student's interest, verifying the paper's claims in a new context. This aligns with the debate findings favoring practical implementation and evaluation. | 2180.54 | 148.06 | 0.06 | 7 |
| Yes | unsupervised, self-supervised, semi-supervised, and supervised representation learning |
undergraduate | Shape as Line Segments: Accurate and Flexible Implicit Surface Representation | Title: Implementation and Evaluation of 2D Edge-Based Dual Contouring for Shape Reconstruction from Line Segment Fields Abstract: The 'Shape as Line Segments' (SALS) paper proposes a new way to represent 3D shapes using a Line Segment Field (LSF) and introduces an Edge-based Dual Contouring (E-DC) algorithm to extract surfaces. This project implements and evaluates E-DC in a simplified 2D setting. Given a predefined 2D LSF on a grid (indicating which grid edges intersect a shape and where), the goal is to implement the E-DC logic to extract the 2D contour. We will test the implementation on synthetic 2D shapes with features like sharp corners and compare its accuracy against the ground truth shape and potentially a simpler baseline like Marching Squares. Experiments:
Related Work: This project implements the Edge-based Dual Contouring (E-DC) algorithm presented in Section 3.2 of the 'Shape as Line Segments' (SALS) paper, but simplified to a 2D context. It focuses on the surface extraction component, assuming the Line Segment Field (LSF) is given, unlike related work on generating LSFs (e.g., SALS Sec 3.4) or other contouring methods like Marching Cubes/Squares. The novelty for the student lies in the hands-on implementation and evaluation of this specific geometric algorithm. Risk Factors & Limitations:
Testable Hypothesis: Implementing the E-DC algorithm logic in 2D, using intersection data from a predefined Line Segment Field for synthetic shapes, will produce contours with a Hausdorff distance to the ground truth that is at least 20% lower than contours generated by a basic Marching Squares approach using the same intersection data, particularly for shapes with sharp corners. | This project is well-suited for an undergraduate based on the provided debate analysis. Implementing the core Edge-based Dual Contouring (E-DC) algorithm in a simplified 2D setting is feasible, provides concrete programming experience with geometric algorithms, and allows for clear, quantitative evaluation. It directly engages with a key component of the SALS paper (Sec 3.2, Fig 4) without requiring deep ML knowledge or large computational resources. The debate suggested improvements like focusing on the algorithm first, using specific test cases, and visualization, which are incorporated. | 2180.54 | 148.06 | 0.06 | 7 |
| Yes | learning on graphs and other geometries & topologies |
undergraduate | This paper was rejected | Title: Replication Study: Evaluating the Impact of Degree-Corrected Benchmarking on Preferential Attachment vs. Common Neighbors Link Prediction Abstract: Link prediction benchmarks are crucial for evaluating graph algorithms, but the paper 'Implicit degree bias in the link prediction task' shows standard methods suffer from degree bias, favoring algorithms that exploit node popularity. This project aims to replicate and verify this finding by implementing two common link prediction algorithms, Preferential Attachment (PA, sensitive to degree) and Common Neighbors (CN, less sensitive), and evaluating them on a standard graph dataset using both the original (biased) and the degree-corrected benchmark proposed by the paper. We will compare the Area Under the ROC Curve (AUC-ROC) scores obtained under both benchmark conditions. We expect to observe a significant drop in PA's relative performance under the degree-corrected benchmark, confirming the paper's findings. Experiments:
Related Work: This project directly investigates the core claim of the paper 'Implicit degree bias in the link prediction task' regarding the effect of degree-corrected benchmarks. It involves implementing fundamental link prediction heuristics (Preferential Attachment, Common Neighbors) often discussed in network science literature. The novelty lies in the hands-on replication and comparative analysis of the benchmark's impact. Risk Factors & Limitations:
Testable Hypothesis: Evaluating Preferential Attachment (PA) and Common Neighbors (CN) on a standard graph dataset using the degree-corrected benchmark from the paper will yield a significantly lower relative AUC-ROC score for PA compared to CN, than when evaluated using the standard benchmark with uniform negative sampling. | This project is suitable for an undergraduate student. Based on the provided debate results, a project involving implementation and comparison of standard algorithms on different benchmarks is preferred over more complex tasks like LLM fine-tuning or VAE implementation. This project focuses on replicating a core experiment from the paper: comparing the performance of simple, well-known link prediction algorithms (Preferential Attachment - PA, and Common Neighbors - CN) on the original vs. the degree-corrected benchmark. It requires basic Python programming skills (using standard libraries like NetworkX and possibly Scikit-learn for AUC), understanding of graph concepts (nodes, edges, degree), and the ability to interpret evaluation metrics (AUC-ROC). It's feasible within a semester, uses readily available datasets and algorithms, and directly addresses the paper's central claim in a practical way. The novelty for the student lies in implementing the methods and observing the effect of the benchmark correction firsthand. | 2180.54 | 148.06 | 0.06 | 7 |
| Yes | |
undergraduate | This paper was rejected | Title: Empirical Comparison of BEAR and PCSE Exploration Strategies for Inverse Constraint Reinforcement Learning in Gridworld Environments Abstract: The paper 'Strategic Exploration for Inverse Constraint Inference' proposes two novel algorithms, BEAR and PCSE, for efficiently learning hidden constraints from expert behavior in reinforcement learning settings. This project aims to implement these two algorithms and compare their empirical performance against standard baseline exploration methods (like epsilon-greedy) in benchmark gridworld environments. The focus will be on evaluating sample efficiency (how quickly they learn) and the accuracy of the inferred constraints (using metrics like WGIoU mentioned in the paper's experiments). This provides practical insight into the effectiveness of these theoretically-grounded exploration strategies. Experiments:
Related Work: This project directly implements and evaluates the BEAR and PCSE algorithms introduced in the input paper ('Strategic Exploration for Inverse Constraint Inference'). It focuses on empirically verifying the paper's claims about the efficiency of these strategic exploration methods compared to simpler baselines within the context of Inverse Constraint Reinforcement Learning (ICRL). It relates to fundamental concepts in reinforcement learning exploration strategies but specifically tests the novel approaches proposed for the ICRL problem. Risk Factors & Limitations:
Testable Hypothesis: In gridworld environments with hidden constraints, the PCSE exploration strategy will converge to an optimal policy satisfying inferred constraints using at least 20% fewer environment samples compared to the BEAR strategy and at least 50% fewer samples compared to an epsilon-greedy baseline, while achieving similar final constraint accuracy (WGIoU). | This project is well-suited for an undergraduate student based on the provided debate logs and typical curriculum. It involves implementing and comparing the core algorithms (BEAR and PCSE) from the paper in accessible gridworld environments. This requires basic Python programming, familiarity with RL concepts (value functions, policies), and use of standard libraries (NumPy, potentially OpenAI Gym). The scope is manageable within 1-2 semesters. Novelty comes from the implementation itself, the direct comparison, and potential analysis of hyperparameter sensitivity, rather than developing new theory. It directly addresses the paper's main contribution in a practical, hands-on manner. The debate logs confirmed this is a good balance of feasibility and learning. | 2180.54 | 148.06 | 0.06 | 7 |
| Yes | |
undergraduate | This paper was rejected | Title: Implementing and Analyzing Layer-wise Prediction Maps for Vision Transformer Explainability on CIFAR-10 Abstract: Vision Transformers (ViTs) are powerful but complex. This project explores the 'Prediction Map' technique from the source paper to understand how ViTs make class predictions. We will implement Prediction Maps for a pre-trained ViT (e.g., ViT-Tiny/Small) on the CIFAR-10 dataset. The core task involves applying the ViT's classification head to patch tokens from different layers to generate class-specific heatmaps. We will visualize these maps, analyze how they evolve across layers, and compare them qualitatively to standard attention maps to understand the complementary information they provide. Experiments:
Related Work: This project implements the core concept of 'Prediction Maps' introduced in 'From Attention to Prediction Maps'. It focuses on applying the paper's method for generating class-specific, gradient-free explanations by reusing the classification head on patch tokens. Unlike the full paper which also introduces PredicAtt fusion, this project focuses solely on generating and analyzing the basic Prediction Maps across layers, comparing them to standard ViT attention maps (e.g., Abnar & Zuidema, 2020) on the CIFAR-10 dataset. The novelty for the student lies in the implementation and comparative layer-wise analysis. Risk Factors & Limitations:
Testable Hypothesis: Implementing the Prediction Map technique for a ViT-Small model on CIFAR-10 will produce class-specific heatmaps where localization of the target object improves (becomes more focused and accurate) in deeper layers (e.g., layer 11 vs layer 3), and these maps will visually differ from standard last-layer attention maps for the same image. | This project is suitable for an undergraduate student with basic Python and ML framework (PyTorch/TensorFlow) knowledge. It directly implements the core 'Prediction Map' concept from the paper on a manageable scale (ViT-Tiny/Small on CIFAR-10/100). It avoids the complexity of PredicAtt fusion but focuses on the key novelty: generating and analyzing layer-wise, class-specific maps. The task involves coding, experimentation, and visualization, aligning well with undergraduate ML course projects and the successful outcome of the provided debate logs favouring this type of project for its feasibility and clear learning objectives. Improvements like comparing with attention maps are included. | 2180.54 | 148.06 | 0.06 | 7 |
| Yes | |
undergraduate | Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control | Title: Applying Supervised Feature Dictionaries to Evaluate Sparse Autoencoder Interpretability for Sentiment Analysis in GPT-2 Small Abstract: Makelov et al. (2024) introduced a method using supervised feature dictionaries to evaluate sparse autoencoders (SAEs) for interpretability in language models, demonstrating it on the IOI task. This project applies their core methodology to a different task: sentiment analysis in GPT-2 Small. We will identify key attributes related to sentiment (e.g., presence of positive/negative words, negation). Using a standard sentiment dataset (e.g., SST-2), we will compute supervised feature dictionaries based on these attributes from GPT-2 Small activations. We will then train a basic task-specific SAE and evaluate its ability to capture sentiment features using the F1 interpretability score proposed in the paper, comparing its latent features to the supervised dictionary. Experiments:
Related Work: This project directly applies the supervised evaluation methodology from Makelov et al. (2024) to a new domain (sentiment analysis). It builds on their concept of using task-specific attributes and supervised dictionaries as a benchmark for SAE interpretability. While sentiment analysis is a common task, this project specifically uses the Makelov et al. framework to evaluate SAE feature quality beyond reconstruction, focusing on F1 interpretability scores. Risk Factors & Limitations:
Testable Hypothesis: A vanilla SAE trained on GPT-2 Small activations from a sentiment analysis task will contain specific latents that achieve a high F1 score (>0.7) for simple sentiment attributes (e.g., 'contains_positive_word', 'contains_negative_word') defined a priori. | Undergraduates typically have foundational ML knowledge and coding skills (Python). This project involves applying the paper's core methodology to a new, relatively well-defined task: sentiment analysis. This requires students to understand the supervised dictionary concept, identify relevant attributes for sentiment, implement the dictionary computation (e.g., mean difference), train a simple SAE (perhaps using existing libraries), and apply the paper's evaluation metrics (F1 score, potentially sufficiency/necessity). This task is less complex than IOI but still requires careful attribute definition and experimentation. It provides hands-on experience with the paper's method in a new context, suitable for a UG project. | 2129.55 | 151.05 | 0.06 | 7 |
| Yes | interpretability and explainable AI |
undergraduate | ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks | Title: Evaluating the Conflict-Free Inverse Gradients (ConFIG) Optimizer for Training Physics-Informed Neural Networks on the 1D Heat Equation Abstract: Training Physics-Informed Neural Networks (PINNs) is often difficult due to conflicting loss terms (PDE residual, boundary/initial conditions). The ConFIG paper proposes an algorithm to mitigate these conflicts. This project implements the ConFIG optimizer for a standard 1D time-dependent PDE (e.g., the 1D Heat Equation) using PyTorch. We will compare the performance of ConFIG against the standard Adam optimizer in terms of solution accuracy (compared to the analytical solution) and loss convergence speed. The goal is to gain hands-on experience with PINNs and advanced optimization techniques. Experiments:
Related Work: Directly implements the ConFIG algorithm presented in the paper. Applies it to a standard benchmark problem (1D Heat Equation) often used for introducing PINNs (related to Raissi et al., 2019). The novelty for the student lies in implementing and comparing this specific advanced optimization method against the standard Adam baseline in the context of PINNs. Risk Factors & Limitations:
Testable Hypothesis: Using the ConFIG optimizer to train a PINN for the 1D Heat Equation will result in a >30% lower L2 error in the final solution compared to using the standard Adam optimizer after the same number of training epochs, while demonstrating more uniform convergence across PDE, initial condition, and boundary condition losses. | This project is suitable for an undergraduate with basic calculus, differential equations, and Python/ML library (PyTorch/TensorFlow) experience. It involves implementing the core ConFIG algorithm (potentially the 2-loss version initially, or the full version using provided code) on a simple, well-understood 1D PDE (like the Heat Equation). The novelty lies in the implementation experience, the direct comparison with a standard baseline (Adam), and analyzing the results, reinforcing concepts from the paper in a manageable scope. This aligns with the debate feedback favoring ConFIG for undergraduates if well-scoped and leveraging existing code. | 2067.99 | 148.06 | 0.06 | 6 |
| Yes | applications to physical sciences (physics, chemistry, biology, etc.) |
undergraduate | UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models | Title: Developing a Python Toolkit for Analyzing Error Types, Effective Accuracy, and Reasoning Gaps of Large Language Models on the UGMathBench Benchmark Abstract: Evaluating Large Language Model (LLM) mathematical reasoning requires detailed analysis beyond simple accuracy scores. The UGMathBench paper identified various error types and metrics like Effective Accuracy (EAcc) and Reasoning Gap (Δ). This project involves developing a Python-based analysis tool to process LLM outputs generated on the UGMathBench benchmark. The tool will implement functions to parse model responses, automatically classify common error types (e.g., calculation vs. reasoning flaws based on heuristics or keywords), calculate EAcc and Δ, and generate visualizations (e.g., bar charts, scatter plots) comparing model performance and error distributions across different mathematical subjects or LLMs. Experiments:
Related Work: This project directly builds upon the UGMathBench paper (Xu et al., ICLR 2025) by creating a tool to operationalize its evaluation metrics (EAcc, Δ) and error analysis concepts. It relates to the broader field of LLM evaluation and interpretation tools. Unlike the original paper which presented the benchmark and initial findings, this project focuses on developing a reusable software tool for deeper, comparative analysis of LLM mathematical reasoning errors and robustness. Risk Factors & Limitations:
Testable Hypothesis: A Python tool can be developed to automatically calculate EAcc and Δ from UGMathBench results and classify errors with >70% agreement with manual classification on a sample set, revealing distinct error profiles between calculus and algebra problems for a specific LLM. | Based on the user-provided debate outcomes, an undergraduate project should be feasible, testable, and provide concrete skills, avoiding excessive complexity or resource needs. The favored idea was developing a tool to analyze LLM errors on UGMathBench. This project aligns with that, tasking the student with creating a Python tool to process LLM outputs for UGMathBench problems, classify errors based on the paper's discussion (calculation, flawed reasoning, etc.), and visualize performance metrics (EAcc, Δ, error distributions) across subjects or models. This builds practical Python, data analysis, and visualization skills, is directly tied to the paper, has clear deliverables, and allows for focused, novel analysis (e.g., comparing specific models or error types) as suggested in the debate improvements. | 2067.99 | 148.06 | 0.06 | 6 |
| Yes | datasets and benchmarks |
undergraduate | Provably Accurate Shapley Value Estimation via Leverage Score Sampling | Title: Empirical Comparison of Leverage SHAP and Kernel SHAP for Efficient and Accurate Feature Importance Estimation Abstract: Shapley values help explain AI predictions, but standard methods like Kernel SHAP can be slow. The Leverage SHAP paper proposes a faster method using 'leverage score' sampling. This project involves implementing both Leverage SHAP and a baseline Kernel SHAP using Python. We will compare their performance (accuracy of Shapley value estimates vs. ground truth, computational time) on 2-3 standard benchmark datasets (e.g., Iris, Diabetes, California Housing from UCI repository). Additionally, we will analyze properties of these datasets (e.g., feature correlation, number of features) to understand when Leverage SHAP provides the most significant advantages. Experiments:
Related Work: This project implements and empirically evaluates the Leverage SHAP algorithm introduced by Musco & Witter. It directly compares this method to the established Kernel SHAP algorithm (Lundberg & Lee, 2017), focusing on the practical trade-offs between speed and accuracy highlighted in the source paper. The project extends the paper's experiments by analyzing dataset properties that might influence the relative performance. Risk Factors & Limitations:
Testable Hypothesis: Leverage SHAP will achieve a lower l2-norm error for a fixed computational budget (time limit) compared to Kernel SHAP when estimating Shapley values on the selected benchmark datasets (n < 20), particularly on datasets with higher feature correlation. | Based on the debates, this project is suitable for undergraduates as it involves implementing and comparing known algorithms (Kernel SHAP) with a recent, well-defined improvement (Leverage SHAP). It uses standard ML libraries and datasets. The novelty for the student lies in implementing Leverage SHAP, performing rigorous comparison, and potentially extending the analysis by investigating *why* one method outperforms the other on certain datasets, fulfilling the desire for a slightly deeper intellectual contribution identified in the debates. | 2067.99 | 148.06 | 0.06 | 6 |
| Yes | interpretability and explainable AI |
undergraduate | CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models | Title: Evaluating Stereotyping Biases in Recent Open-Source Language Models using the CEB Direct Evaluation Framework Abstract: The CEB paper introduced a benchmark for evaluating fairness in Large Language Models (LLMs). This project aims to apply a specific subset of the CEB evaluation methodology to assess potential biases in two recent, accessible open-source LLMs (e.g., Mistral-7b, Llama3-8b) that were not extensively covered in the original paper. We will focus on the 'Stereotyping' bias type using the 'Recognition' and 'Selection' direct evaluation tasks for 'Gender' and 'Race' social groups. Experiments will involve running the relevant CEB dataset prompts through the target LLMs and calculating Micro-F1 scores. Results will be compared to baselines reported in the CEB paper to understand how bias profiles may differ in newer models. Experiments:
Related Work: This project directly applies the methodology and subsets of the datasets from the CEB paper (Wang et al., 2025). It extends the paper's analysis by evaluating newer open-source LLMs (e.g., Mistral-7b, Llama3-8b) not extensively studied in the original work. The novelty lies in providing updated benchmark results for these specific models, contributing to the ongoing evaluation landscape described in surveys like Gallegos et al. (2023). Risk Factors & Limitations:
Testable Hypothesis: Evaluating two specific open-source LLMs (e.g., Mistral-7b, Llama3-8b) using the CEB direct evaluation tasks (Recognition, Selection) for Gender and Race stereotypes will reveal statistically significant differences (p < 0.05) in Micro-F1 scores compared to at least one baseline model reported in the CEB paper (e.g., Llama2-7b). | Based on the debates, undergraduates need a well-scoped project with clear steps, using existing frameworks. Replicating a subset of the CEB evaluation on newer, accessible open-source models fits this well. It leverages the paper's main contribution (the benchmark structure and methodology) and allows students to practice evaluation techniques. The scope is narrowed to specific bias types, tasks, and models to ensure feasibility. This involves using Python, potentially LLM APIs or small local models, and statistical comparison, suitable for undergrad skills. The novelty comes from evaluating models not included in the original paper. | 2067.99 | 148.06 | 0.06 | 6 |
| Yes | datasets and benchmarks |
undergraduate | Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? | Title: Investigating the Impact of Activation Functions on Circuit Multiplicity in MLPs Trained for XOR Abstract: The source paper demonstrates that multiple 'circuits' (subnetworks) within a small MLP can perfectly compute the XOR function, illustrating the non-identifiability of mechanistic explanations. This project aims to replicate and extend this finding by investigating how the choice of activation function influences this circuit multiplicity. We will train several small MLPs (e.g., 2 hidden layers of size 3) to compute XOR, each using a different activation function (e.g., ReLU, Sigmoid, Tanh, GeLU). We will then implement a simplified version of the paper's 'where-then-what' circuit search strategy to enumerate circuits with perfect accuracy (zero circuit error) for each trained MLP. The project will compare the number and basic properties (e.g., size/sparsity) of perfect circuits found across the different activation functions. Experiments:
Related Work: This project directly replicates the experimental setup (XOR function, small MLP) and methodology ('where-then-what' circuit search based on circuit error) from the source paper ('Everything, Everywhere...'). It extends the paper's investigation by systematically varying the activation function, a factor known to influence network behavior but not explicitly studied in the context of circuit multiplicity in the paper's experiments, thus probing the robustness of the non-identifiability finding. Risk Factors & Limitations:
Testable Hypothesis: Training identical MLP architectures on the XOR task using different activation functions (e.g., Sigmoid vs. ReLU vs. Tanh) will result in a statistically significant difference in the number of subnetworks ('perfect circuits') capable of perfectly replicating the XOR function. | This project is suitable for an undergraduate student with basic ML/Python knowledge. It involves replicating a core experiment from the paper (circuit search on XOR MLPs) and performing a systematic, controlled extension (varying activation functions) – a common structure for undergraduate research. It directly engages with the paper's methodology ('where-then-what') and findings (multiplicity of explanations). The scope is manageable, using small MLPs computationally feasible on standard hardware. The novelty lies in the systematic comparison across activation functions, which the paper acknowledges as influential but doesn't explore in this specific context. The debate feedback strongly supports this direction, highlighting its feasibility, novelty, and learning potential. | 2067.99 | 148.06 | 0.06 | 6 |
| Yes | interpretability and explainable AI |
undergraduate | This paper was rejected | Title: Replicating and Extending the Analysis of Positional Bias in Text Embeddings using Scientific Abstracts and Comparing APE, RoPE, and ALiBi Models Abstract: The 'Bias Learning' paper found that text embedding models often overweight the beginning of text. This project aims to replicate and extend this finding by investigating positional bias in the domain of scientific abstracts. We will collect abstracts from a specific field (e.g., Computer Science from arXiv), generate sentence embeddings using pre-trained models with different positional encoding techniques (APE, RoPE, ALiBi), and perform regression analysis as described in the paper to quantify sentence importance based on position. We will compare the extent of positional bias across the different models and analyze if the structure of scientific abstracts influences the bias patterns. This provides practical experience with embedding models and regression analysis. Experiments:
Related Work: This project directly replicates the regression analysis methodology from the 'Bias Learning' paper to investigate positional bias. It extends the paper's work by applying the analysis to a new domain (scientific abstracts) and explicitly comparing models with different positional encoding strategies (APE, RoPE, ALiBi) on this specific data type. Risk Factors & Limitations:
Testable Hypothesis: Applying the regression analysis from the 'Bias Learning' paper to scientific abstracts will reveal a statistically significant negative correlation between sentence position and its regression coefficient (importance weight) for APE and RoPE models, while the ALiBi model will exhibit significantly less positional bias. | This project is suitable for an undergraduate student as it involves replicating and slightly extending a core analysis from the paper using standard tools and accessible datasets. It requires understanding embeddings, regression, and data analysis but doesn't demand novel algorithm development or large-scale training. The novelty comes from applying the established methodology to a new domain (scientific abstracts) and comparing different model types (APE, RoPE, ALiBi). This aligns well with the user-provided debate context which strongly favored this type of replication/extension project for undergraduates. | 2067.99 | 148.06 | 0.06 | 6 |
| Yes | |
undergraduate | Language Representations Can be What Recommenders Need: Findings and Potentials | Title: Evaluating Linearly Mapped Language Model Embeddings for Movie Recommendation Against a Collaborative Filtering Baseline Abstract: Language models (LMs) encode rich information. The source paper suggests this includes user preference signals useful for recommendations. This project investigates this by implementing a simple recommendation model using pre-trained LM embeddings of movie titles. We will obtain embeddings for MovieLens-100k titles using a standard LM (e.g., DistilBERT via sentence-transformers). A linear mapping layer will be trained to project these embeddings into a 'behavior space' predicting user ratings or interactions. Performance will be compared against a standard collaborative filtering baseline (e.g., KNNBasic from Surprise library) using metrics like RMSE and NDCG@10. This explores the feasibility of leveraging LM semantics for recommendation as highlighted in the paper. Experiments:
Related Work: This project implements a simplified version of the 'Linear Mapping' approach explored in Section 3 and Figure 1 of the source paper. It tests the paper's hypothesis that LM representations can be linearly projected into an effective behavior space for recommendation, comparing it against traditional collaborative filtering (CF) methods like Matrix Factorization or KNN (baseline mentioned in paper Figure 1b). It uses standard tools like sentence-transformers and the Surprise library for accessibility. Risk Factors & Limitations:
Testable Hypothesis: A recommendation model using item title embeddings from a pre-trained language model (all-MiniLM-L6-v2) projected via a trained linear layer will achieve a statistically significant improvement in NDCG@10 compared to a KNNBasic collaborative filtering baseline on the MovieLens-100k dataset. | This project is suitable for an undergraduate student with basic Python and ML knowledge (as confirmed by the debate analysis favoring this idea). It involves implementing a simplified version of the paper's core concept: using pre-trained LM embeddings and a linear map for recommendations. It uses standard libraries (HuggingFace Transformers/sentence-transformers, Scikit-learn, Pandas, Surprise) and a common dataset (MovieLens). The scope is manageable for a semester project, focusing on implementation, comparison with a known baseline (CF), and analysis of results. The debate feedback suggested improvements like exploring smaller LMs and detailed metrics, which are incorporated. The primary risk is debugging the implementation, which is typical for undergraduate projects. | 2052.00 | 148.06 | 0.06 | 6 |
| Yes | other topics in machine learning (i.e., none of the above) |
undergraduate | Approximation algorithms for combinatorial optimization with predictions | Title: Empirical Evaluation of a Learning-Augmented Greedy Algorithm for the Set Cover Problem Abstract: Fast approximation algorithms often sacrifice solution quality. Antoniadis et al. (2024) proposed using predictions to improve these algorithms. This project implements and evaluates their general black-box method (Theorem 1) for the classic NP-hard Set Cover problem. We will implement the standard greedy approximation algorithm for Set Cover and the proposed learning-augmented modification. Using benchmark Set Cover instances, we will generate simulated predictions of varying quality (controlling false positives `η+` and false negatives `η-`). Experiments will compare the solution quality (size of the set cover) and potentially runtime of the prediction-augmented algorithm against the baseline greedy algorithm, analyzing how performance degrades with increasing prediction error. The expected outcome is an empirical validation of the learning-augmented approach for Set Cover. Experiments:
Related Work: This project directly implements the general framework for minimization problems (Theorem 1) from Antoniadis et al. (2024), applying it specifically to the Set Cover problem using the standard greedy algorithm (which has approximation ratio `ρ ≈ ln |Universe|`). It relates to classic work on Set Cover approximations. The novelty lies in the empirical implementation and evaluation of the *learning-augmented* version for Set Cover, analyzing its performance sensitivity to controlled prediction errors, which is not explicitly detailed for Set Cover in the paper. Risk Factors & Limitations:
Testable Hypothesis: Applying the prediction-augmented framework (Theorem 1) to the greedy Set Cover algorithm on benchmark instances, using simulated predictions where the total error weight (`η+ + (ρ-1)η-`) is less than 20% of the baseline solution cost, will yield set covers that are consistently >10% smaller than those found by the baseline greedy algorithm. | Based on the debate outcomes, this project focuses on implementing the general black-box framework (Theorem 1) for the Set Cover problem. Undergraduates should be able to implement the greedy Set Cover algorithm and the weight modification logic. Using simulated predictions makes the project self-contained and allows systematic study of performance vs. prediction error. The novelty lies in the empirical evaluation and analysis for Set Cover, validating the paper's general claims in a specific context. This project provides practical experience with algorithm implementation, empirical analysis, and understanding approximation trade-offs. | 2052.00 | 148.06 | 0.06 | 6 |
| Yes | optimization |
undergraduate | This paper was rejected | Title: A Comparative Study of Gymnasium's Object-Oriented vs. Functional APIs for Implementing Policy Gradient Algorithms Abstract: The Gymnasium paper introduces both a standard object-oriented API (`Env`) and a newer functional API (`FuncEnv`). This project aims to compare these two interfaces in practice. We will implement two standard policy gradient algorithms, REINFORCE and Actor-Critic (A2C), using both the `Env` and `FuncEnv` APIs. These implementations will be tested on classic control environments (CartPole-v1, Acrobot-v1). The comparison will focus on qualitative aspects (ease of implementation, code structure) and quantitative metrics (runtime performance, learning curves). This project provides hands-on experience with different Gymnasium APIs and evaluates their practical trade-offs for implementing common RL algorithms. Experiments:
Related Work: This project directly engages with the Gymnasium paper by comparing two key API structures it provides (`Env`, `FuncEnv`). It relates to standard implementations of REINFORCE and Actor-Critic algorithms, often used as baselines in RL research. The novelty lies not in the algorithms themselves, but in the direct comparison of implementing them using Gymnasium's different interfaces, providing practical insights into the library's design choices. Risk Factors & Limitations:
Testable Hypothesis: Implementing REINFORCE and A2C using Gymnasium's `FuncEnv` API for CartPole-v1 will result in code that is qualitatively different (e.g., more functional style) but achieves statistically similar learning performance (final average reward) and comparable wall-clock training time compared to implementations using the standard `Env` API on consumer hardware. | This project is suitable for an undergraduate student with introductory AI/ML coursework and Python programming skills. Based on the debate feedback favoring feasibility and clear learning objectives, this project focuses on comparing Gymnasium's standard object-oriented `Env` API with the newer `FuncEnv` API. It involves implementing well-known, relatively simple RL algorithms (REINFORCE, A2C) using both APIs for standard benchmark environments (CartPole, Acrobot). This provides practical experience with the library's core features, reinforces understanding of RL algorithms, and involves direct comparison based on implementation effort and runtime performance metrics. The scope is manageable, resources (Gymnasium, standard libraries) are readily available, and the comparison aspect adds a layer of analysis beyond simple implementation. | 2052.00 | 148.06 | 0.06 | 6 |
| Yes | |
undergraduate | This paper was rejected | Title: Re-Examining Natural Memorization Dynamics in Natural Language Processing Models Abstract: The paper 'Back to Fundamentals' challenges previous findings by showing that for 'natural' data, increasing model size (over-parameterization) and training time can actually *decrease* memorization in deep learning models, using computer vision benchmarks. This project aims to test the generality of these findings in the Natural Language Processing (NLP) domain. We will replicate the core experiments from the paper, measuring natural memorization scores, but using standard NLP datasets (e.g., IMDb sentiment classification) and relevant architectures (e.g., LSTM, a simple Transformer). We will investigate how memorization changes with varying model complexity (e.g., small vs. larger LSTM/Transformer) and across training epochs. The expected outcome is to confirm or contrast the paper's findings in the NLP domain. Experiments:
Related Work: This project directly replicates the experimental setup of 'Back to Fundamentals: Re-Examining Memorization...' but applies it to the NLP domain. The paper primarily used image datasets (CIFAR, Tiny ImageNet) and vision architectures (VGG, ResNet, ViT). This project uses text data and sequence models (LSTM, Transformer) to test if the core findings—that natural memorization decreases with over-parameterization and extended training—hold true across different data modalities and model families, providing evidence for the generality of the paper's claims. Risk Factors & Limitations:
Testable Hypothesis: Applying the natural memorization analysis from 'Back to Fundamentals' to standard NLP classification tasks will show that (1) larger NLP models (more parameters) exhibit less natural memorization than smaller ones, and (2) natural memorization rates decrease after an initial peak as training progresses, consistent with the paper's findings in computer vision. | This project is suitable for an undergraduate student as it involves replicating and extending key findings from the paper, which is a standard practice in undergraduate research and aligns with the debate outcomes. It leverages existing programming skills (Python, ML libraries) and knowledge from coursework (deep learning concepts, model training). The chosen extension—testing the paper's findings on a different data modality (NLP) and architectures (LSTM/Transformer)—provides clear novelty without requiring the student to invent entirely new methods. The scope is manageable: train a few standard models on accessible datasets (e.g., IMDb sentiment). Feasibility is high given standard university computing resources (cloud credits or local GPUs might be needed for Transformers, but smaller versions can be used). It directly addresses the paper's core claims about over-parameterization and training iterations. | 2052.00 | 148.06 | 0.06 | 6 |
| Yes | |
undergraduate | This paper was rejected | Title: Evaluating the Sensitivity of Initialization Completeness Effects to Hyperparameters in the Multi-Digit XOR Task Abstract: The paper 'Interpretable Patterns in Random Initialization...' suggests that the way a neural network is randomly initialized influences the final solution it learns, proposing a 'completeness' metric for tasks like multi-digit XOR. This project aims to replicate this finding for the 3-bit XOR task and extend it by systematically investigating how sensitive this relationship is to common training hyperparameters. We will implement the XOR task, calculate the initial 'completeness' signal `g(a)` for various initializations, train MLP models, and analyze if the initial signal predicts the learned representation. We will then repeat this analysis while varying weight decay and learning rate to test the robustness of the paper's findings. Experiments:
Related Work: This project directly builds upon the methodology and findings presented in Section 3.2 (Multi-digit XOR) of the paper 'Interpretable Patterns in Random Initialization...'. It aims to replicate Figure 2 and test the robustness of the 'completeness' hypothesis to hyperparameter variations, a standard practice in empirical ML validation. It relates to broader concepts of hyperparameter tuning and sensitivity analysis in deep learning. The novelty lies in the systematic evaluation of hyperparameter effects on the specific initialization-representation link identified in the paper. Risk Factors & Limitations:
Testable Hypothesis: The high correlation (>80th percentile rank on average) between the initial 'completeness' signal `g(a)` and the learned representation for the 3-bit XOR task, as reported in the paper, will persist across variations in weight decay (0.1-10) and learning rate (0.001-0.1), demonstrating the robustness of the finding, although the strength of the correlation may vary. | Based on the paper and the debate outcomes, a replication and extension project is ideal for undergraduates. The XOR task is simple enough to implement, the 'completeness' metric `g(a)` is defined, and hyperparameter sensitivity analysis provides a clear path for novel investigation using standard ML experimental practices. This project aligns with typical undergraduate skills (Python, basic ML libraries), is feasible within a semester or two, and offers valuable learning in interpretability and experimental design. The suggestions from the debates (using standard libraries, specific hyperparameters, visualization, multiple seeds) are incorporated. | 2052.00 | 148.06 | 0.06 | 6 |
| Yes | |
undergraduate | This paper was rejected | Title: Evaluating Predefined FuRud-Inspired Fuzzy Corrections for Debiasing In-Context Learning on a Novel Text Classification Task Abstract: In-Context Learning (ICL) with Large Language Models (LLMs) can produce biased results in classification tasks. The FuRud paper proposed using fuzzy rules to correct output probabilities interpretably. This project implements the *correction* mechanism inspired by FuRud, applying predefined triangular membership functions (from the paper) to adjust probabilities generated by a standard LLM API for a new text classification dataset (e.g., SST-2 sentiment analysis, or a different news categorization task). We will measure the impact of these corrections on overall accuracy and pairwise class accuracy bias (COBias), comparing results before and after applying the FuRud-inspired transformations. Experiments:
Related Work: Directly applies the core correction mechanism (fuzzy membership functions) proposed in the FuRud paper. Unlike the original paper, it uses predefined functions instead of optimizing them. Relates to practical ML evaluation, bias measurement in NLP, and working with LLM APIs. Novelty lies in applying the technique to a new dataset and analyzing its effectiveness in that specific context. Risk Factors & Limitations:
Testable Hypothesis: Applying predefined FuRud-inspired fuzzy rule transformations (e.g., amplifying low probabilities for underperforming classes, reducing high probabilities for overperforming ones) to the 1-shot ICL output probabilities from LLM L on dataset D will reduce the COBias metric by at least 5% compared to the baseline probabilities, while maintaining overall accuracy within +/- 2%. | This project is suitable for an undergraduate student with basic Python programming skills and introductory ML knowledge. It leverages the FuRud paper's core idea but simplifies it by using *predefined* fuzzy membership functions (as shown in Fig 2) instead of requiring the student to implement the complex multi-objective optimization (SA). The student focuses on applying these functions to transform probabilities from a standard LLM API for a *new* dataset, calculating standard metrics (Accuracy, COBias), and analyzing the results. This provides practical experience with bias metrics, LLM APIs, and implementing paper concepts without excessive complexity. This aligns with the debate outcome favoring feasibility and clear learning objectives. | 2045.23 | 148.33 | 0.06 | 6 |
| Yes | |
undergraduate | Improving Unsupervised Constituency Parsing via Maximizing Semantic Information | Title: Implementing and Evaluating the SemInfo Metric for Ranking Parse Tree Quality on Synthetic Data Abstract: Why do some ways of structuring a sentence feel more 'meaningful' than others? The SemInfo paper suggests a method to quantify this 'semantic information'. This project implements the core SemInfo calculation based on analyzing substrings in pre-generated paraphrases. We will use a provided synthetic dataset with sentences having known 'good' and 'bad' parse structures. We will calculate SemInfo scores for both parse types and compare if SemInfo consistently assigns higher scores to the good parses. We will also calculate basic Log-Likelihood scores from a simple PCFG trained on the synthetic data and compare its ability to rank parses against SemInfo's. This provides hands-on experience with semantic analysis concepts and evaluating parsing objectives. Experiments:
Related Work: This project implements and analyzes the core SemInfo calculation proposed in 'Improving Unsupervised Constituency Parsing via Maximizing Semantic Information'. It focuses on validating the paper's claim that SemInfo better reflects parse quality than Log-Likelihood (LL), using a simplified setting with synthetic data and pre-defined parses. It relates to basic concepts in parsing evaluation and probabilistic models (PCFGs). The novelty for the student lies in implementing the SemInfo metric and performing a direct comparison against the traditional LL objective. Risk Factors & Limitations:
Testable Hypothesis: The calculated SemInfo score will correctly rank the predefined 'good' parse structure higher than the 'bad' parse structure for >85% of sentences in the provided synthetic dataset, demonstrating significantly better ranking performance than Log-Likelihood scores derived from a baseline PCFG trained on the same data. | Based on the debates, undergraduates need a feasible project with clear learning outcomes, manageable scope, and reduced risk. Implementing the full SemInfo training loop is too complex. This project focuses on the core SemInfo *calculation* and its ability to distinguish good parses from bad ones, compared to LL, using controlled synthetic data. This avoids complex PCFG training and LLM paraphrasing costs/noise while reinforcing key concepts. It directly addresses the feasibility and scope concerns raised in the debates. | 2031.51 | 148.06 | 0.06 | 6 |
| Yes | applications to computer vision, audio, language, and other modalities |