ICLR 2025 Paper Ranking | OpenReview Analysis

Title	Score	Keywords	Primary Area	Rating 1-10	Confidence 1-5	Soundness 1-4	Presentation 1-4	Contribution 1-4
Scaling In-the-Wild Training for Diffusion-based Illumination Harmonization and Editing by Imposing Consistent Light Transport	71.19	diffusion model, illumination editing, image editing	applications to computer vision, audio, language, and other modalities	10,10,10,10	3,4,5,5	3,4,4,4	3,3,4,4	4,4,4,4
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation	65.22	co-speech video generation, cross-modal retrieval, audio repsentation learning, motion repsentation learning, video frame interpolation	applications to computer vision, audio, language, and other modalities	8,8,8,10	4,5,5,5	3,4,4,4	2,4,4,4	3,4,4,4
SAM 2: Segment Anything in Images and Videos	60.67	computer vision, video segmentation, image segmentation	applications to computer vision, audio, language, and other modalities	8,8,10,10	4,4,4,5	3,4,4,4	3,3,4,4	3,3,4,4
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows	60.17	LLM Benchmark, Data Science and Engineering, Code Generation, Text-to-SQL, LLM Agent	datasets and benchmarks	8,8,8,8	4,4,5,5	3,3,4,4	3,4,4,4	3,4,4,4
OLMoE: Open Mixture-of-Experts Language Models	59.88	large language models, mixture-of-experts, open-source	foundation or frontier models, including LLMs	8,8,10	2,3,5	4,4,4	4,4,4	3,4,4
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions	58.50	Code Generation, Tool Use, Instruction Following, Benchmark	datasets and benchmarks	8,8,10,10	3,4,4,4	3,3,4,4	3,4,4,4	3,4,4,4
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency	58.43	Diffusion Model, Avatar, Portrait Animation, Audio-Condition Video Generation	applications to computer vision, audio, language, and other modalities	8,8,8,8	4,4,4,5	4,4,4,4	3,3,4,4	3,3,4,4
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models	57.75	Mechanistic Interpretability, Hallucinations, Language Models	interpretability and explainable AI	8,8,10,10	4,4,4,5	3,3,4,4	3,4,4,4	3,3,3,4
Latent Bayesian Optimization via Autoregressive Normalizing Flows	55.69	Bayesian optimization, normalizing flow	probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)	8,8,8,8	3,4,5,5	3,3,4,4	3,3,4,4	3,4,4,4
Simplifying, Stabilizing and Scaling Continuous-time Consistency Models	55.64	continuous-time consistency models, diffusion models, fast sampling	generative models	8,8,10,10,10	3,4,4,4,5	3,3,3,4,4	3,3,3,4,4	3,3,4,4,4
ShEPhERD: Diffusing shape, electrostatics, and pharmacophores for bioisosteric drug design	54.18	3D molecular generation, drug design, molecules	applications to physical sciences (physics, chemistry, biology, etc.)	6,8,10	4,5,5	2,3,4	3,4,4	3,4,4
Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues	53.37	State Tracking, state space, mamba, Linear RNN, Linear Attention, GLA, DeltaNet, Formal Languages	other topics in machine learning (i.e., none of the above)	8,8,8	3,4,4	3,4,4	3,4,4	3,4,4
Scaling and evaluating sparse autoencoders	53.31	interpretability, sparse autoencoders, superposition, scaling laws	interpretability and explainable AI	3,8,10,10,10	4,4,4,4,4	3,3,4,4,4	3,3,3,4,4	2,4,4,4,4
$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics	52.76	audio classification, multi-label, dataset collection, bioacoustics	datasets and benchmarks	6,8,8,8	4,4,5,5	2,4,4,4	3,4,4,4	2,3,4,4
On Scaling Up 3D Gaussian Splatting Training	52.61	Gaussian Splatting, Machine Learning System, Distributed Training	infrastructure, software libraries, hardware, systems, etc.	8,8,8,8	4,4,5,5	3,3,3,4	2,3,4,4	3,4,4,4
Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?	52.59	llms, translation, low-resource, grammar, long-context, linguistics	foundation or frontier models, including LLMs	6,8,8	4,4,4	4,4,4	4,4,4	2,3,4
When Attention Sink Emerges in Language Models: An Empirical View	52.59	Attention Sink, Language Models, Empirical Study	foundation or frontier models, including LLMs	6,8,8	4,4,5	4,4,4	3,3,4	3,3,4
MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions	52.54	LLM evaluation, multi-table question answering; multi-hop question answering	datasets and benchmarks	8,8,8	4,4,5	3,4,4	3,3,4	3,3,4
LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization	50.55	optimization, LoRA	optimization	8,8,10	3,3,4	4,4,4	3,3,4	3,3,4
Transformers Provably Solve Parity Efficiently with Chain of Thought	50.41	transformers, chain of thought, parity, self-consistency	learning theory	8,8,10	3,3,4	3,4,4	3,4,4	3,3,4

Scaling In-the-Wild Training for Diffusion-based Illumination Harmonization and Editing by Imposing Consistent Light Transport

71.19

diffusion model, illumination editing, image editing

applications to computer vision, audio, language, and other modalities

10,10,10,10

3,4,5,5

3,4,4,4

3,3,4,4

4,4,4,4

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

65.22

co-speech video generation, cross-modal retrieval, audio repsentation learning, motion repsentation learning, video frame interpolation

applications to computer vision, audio, language, and other modalities

8,8,8,10

4,5,5,5

3,4,4,4

2,4,4,4

3,4,4,4

SAM 2: Segment Anything in Images and Videos

60.67

computer vision, video segmentation, image segmentation

applications to computer vision, audio, language, and other modalities

8,8,10,10

4,4,4,5

3,4,4,4

3,3,4,4

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

60.17

LLM Benchmark, Data Science and Engineering, Code Generation, Text-to-SQL, LLM Agent

datasets and benchmarks

8,8,8,8

4,4,5,5

3,3,4,4

3,4,4,4

OLMoE: Open Mixture-of-Experts Language Models

59.88

large language models, mixture-of-experts, open-source

foundation or frontier models, including LLMs

8,8,10

2,3,5

4,4,4

3,4,4

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

58.50

Code Generation, Tool Use, Instruction Following, Benchmark

datasets and benchmarks

8,8,10,10

3,4,4,4

3,3,4,4

3,4,4,4

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

58.43

Diffusion Model, Avatar, Portrait Animation, Audio-Condition Video Generation

applications to computer vision, audio, language, and other modalities

8,8,8,8

4,4,4,5

4,4,4,4

3,3,4,4

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

57.75

Mechanistic Interpretability, Hallucinations, Language Models

interpretability and explainable AI

8,8,10,10

4,4,4,5

3,3,4,4

3,4,4,4

3,3,3,4

Latent Bayesian Optimization via Autoregressive Normalizing Flows

55.69

Bayesian optimization, normalizing flow

probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

8,8,8,8

3,4,5,5

3,3,4,4

3,4,4,4

Simplifying, Stabilizing and Scaling Continuous-time Consistency Models

55.64

continuous-time consistency models, diffusion models, fast sampling

generative models

8,8,10,10,10

3,4,4,4,5

3,3,3,4,4

3,3,4,4,4

ShEPhERD: Diffusing shape, electrostatics, and pharmacophores for bioisosteric drug design

54.18

3D molecular generation, drug design, molecules

applications to physical sciences (physics, chemistry, biology, etc.)

6,8,10

4,5,5

2,3,4

3,4,4

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

53.37

State Tracking, state space, mamba, Linear RNN, Linear Attention, GLA, DeltaNet, Formal Languages

other topics in machine learning (i.e., none of the above)

8,8,8

3,4,4

Scaling and evaluating sparse autoencoders

53.31

interpretability, sparse autoencoders, superposition, scaling laws

interpretability and explainable AI

3,8,10,10,10

4,4,4,4,4

3,3,4,4,4

3,3,3,4,4

2,4,4,4,4

$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics

52.76

audio classification, multi-label, dataset collection, bioacoustics

datasets and benchmarks

6,8,8,8

4,4,5,5

2,4,4,4

3,4,4,4

2,3,4,4

Title: Feasibility Study for an Automated Acoustic Monitoring Service for Avian Pest Detection in Agriculture using BirdSet-related Models

Abstract:

Avian pests cause significant agricultural damage. This project proposes developing a Minimum Viable Product (MVP) for an automated acoustic monitoring service to detect key pest bird species in agricultural settings, leveraging models potentially trainable on BirdSet data. We will identify target pests, simulate monitoring using relevant BirdSet subsets and/or field recordings, and evaluate detection accuracy using a benchmarked model. The goal is to demonstrate technical feasibility and potential value for precision agriculture.

Experiments:

Identify 3-5 key avian pest species for a specific agricultural region (e.g., vineyards) and curate relevant vocalization samples (potentially using XC subset from BirdSet).
Adapt/fine-tune a pre-trained model benchmarked in the BirdSet paper (e.g., EfficientNet) on the curated pest species data, potentially mixing with representative background noise.
Evaluate the MVP's detection accuracy (precision, recall) on a small, annotated test set simulating the target agricultural environment (using held-out XC data or field recordings if available).

Related Work:

This project applies the large-scale audio classification concepts and models benchmarked in the BirdSet paper to a specific commercial application: agricultural pest monitoring. While BirdSet provides general avian classification benchmarks, and works like [specific pest bird study] identify problem species, this project focuses on the feasibility of an automated acoustic detection service using similar deep learning techniques for targeted agricultural value.

Risks/Limits:

Difficulty obtaining representative, annotated recordings from specific agricultural environments for fine-tuning and testing.
Model generalization performance may vary significantly depending on background noise and acoustic conditions specific to farms.
Market acceptance and cost-effectiveness compared to traditional pest monitoring methods.
Requires integrating model output into a user-friendly reporting service (outside initial MVP scope).

Hypothesis:

An acoustic monitoring MVP, using a model architecture similar to those benchmarked on BirdSet and fine-tuned on relevant pest species, can achieve >75% precision and recall for detecting target avian pests in simulated agricultural audio environments.

Title: Comparing Avian Species Richness and Annotation Density Across Selected BirdSet Soundscape Datasets

Abstract:

Biodiversity varies significantly across different locations. The BirdSet paper introduces several soundscape datasets from diverse locations. This project aims to compare the documented avian species richness and recording density across different BirdSet soundscape test datasets. We will analyze the metadata provided with BirdSet to extract species counts and annotation numbers for datasets like NES, HSN, and SSW. The goal is to visualize and compare these metrics to understand how biodiversity representation differs across these locations within the benchmark.

Experiments:

Extract the list of unique bird species (classes) documented for the NES, HSN, and SSW test datasets from the BirdSet paper (Table 2) or metadata.
Extract the total number of annotations provided for the NES, HSN, and SSW test datasets from the BirdSet paper (Table 2) or metadata.
Create bar charts comparing the number of unique species and total annotations across the three selected datasets.

Related Work:

This project directly explores the data presented in the BirdSet paper, which provides a large-scale dataset for avian bioacoustics. While the paper focuses on ML benchmarking, this project uses the dataset's metadata to investigate basic ecological concepts like species richness comparison across different geographical locations represented (e.g., Costa Rica/Colombia (NES) vs. Sierra Nevada (HSN) vs. Northeast US (SSW)), similar to introductory biodiversity studies.

Risks/Limits:

Accessing and correctly interpreting specific numbers from the paper's tables or metadata might require guidance.
Analysis is based solely on the documented species and annotations, not actual audio content analysis.
Limited to the specific datasets selected for comparison within BirdSet.
Does not account for potential sampling bias in the original data collection.

Hypothesis:

The species richness (number of unique bird species documented) in the Neotropical dataset (NES) is significantly higher than in the Sierra Nevada dataset (HSN) as represented in the BirdSet benchmark collection.

Title: Ethical Challenges and Responsible Use Guidelines for Large-Scale Avian Bioacoustic Monitoring Systems

Abstract:

Large-scale bioacoustic datasets like BirdSet enable powerful environmental monitoring but also raise ethical considerations regarding data ownership, privacy (incidental human sounds), and potential misuse. This project will investigate the perceived ethical challenges and responsible use guidelines for deploying automated acoustic monitoring systems trained on such datasets. We will conduct a literature review of bioacoustic ethics and potentially survey conservation practitioners or citizen scientists. The goal is to outline key ethical concerns and propose best practices for responsible deployment.

Experiments:

Conduct a literature review on ethical considerations in bioacoustics, passive acoustic monitoring, and large ecological datasets.
Identify and categorize potential ethical issues related to BirdSet-like datasets and their applications (e.g., data sovereignty, incidental recordings, accessibility).
Develop a short concept survey or interview guide to gauge awareness and concern regarding these ethical issues among potential users (e.g., researchers, conservationists).

Related Work:

This project examines the ethical dimensions inspired by the large-scale data collection and potential applications discussed in the BirdSet paper. While BirdSet focuses on the technical aspects of the dataset and benchmarks, this project relates to broader discussions on data ethics (e.g., Gebru et al.'s Datasheets for Datasets) and specific concerns in conservation technology ethics, exploring the societal implications of deploying powerful monitoring tools.

Risks/Limits:

Difficulty in accessing a diverse group of stakeholders (researchers, citizen scientists, policymakers) for surveys or interviews.
Ethical perceptions can be subjective and vary culturally.
Limited scope, focusing on identifying issues rather than solving them.
Relies on secondary sources and potential user opinions, not technical system audits.

Hypothesis:

Conservation practitioners perceive data sovereignty and the potential for misuse of species location data derived from BirdSet-like systems as more significant ethical concerns than incidental recording of human activity.

Title: Linking Avian Acoustic Activity Patterns from the BirdSet Benchmark to Ecological Habitat Variables

Abstract:

Passive acoustic monitoring offers a non-invasive way to study avian populations and their response to environmental changes. This project investigates the relationship between acoustic activity patterns of specific ecological indicator bird species and habitat characteristics within one of the BirdSet soundscape locations (e.g., NES coffee farms). Using BirdSet data and potentially pre-trained models, we will estimate species-specific vocalization rates and correlate them with external data on habitat structure or land management intensity. The aim is to assess the utility of BirdSet-derived acoustic indices for ecological monitoring.

Experiments:

Select 3-5 ecological indicator species relevant to the chosen BirdSet test location (e.g., NES) and obtain corresponding habitat/management data for recording sites (requires external data sourcing).
Apply a suitable pre-trained model (e.g., Perch, or a model trained on BirdSet's XCL) to the chosen test dataset (e.g., NES) to generate estimates of vocalization presence probability or frequency for the indicator species per segment/site.
Perform statistical analysis (e.g., correlation, regression) to investigate the relationship between the derived acoustic activity indices and the sourced habitat/management variables.

Related Work:

This project leverages the BirdSet dataset and benchmarked models as tools for ecological research, linking bioacoustics to conservation biology. It builds on the BirdSet paper's provision of annotated soundscapes and relates to ecological studies using indicator species and acoustic indices to assess habitat quality or environmental change, providing a novel application of the large-scale acoustic resource.

Risks/Limits:

Availability, quality, and spatial resolution of external habitat/management data corresponding to BirdSet recording locations.
Accuracy limitations of the acoustic models in estimating true vocalization rates, especially for rare or quiet species.
Confounding factors (e.g., weather, time of day) affecting acoustic activity need to be considered or controlled for in the analysis.
Focuses on correlation, establishing causality requires further ecological study.
Requires expertise in both ecology and data analysis.

Hypothesis:

The estimated acoustic activity level (e.g., average detection probability per hour) of forest-dependent indicator species within the BirdSet NES dataset is positively correlated (R > 0.3) with increased shade tree cover documented at the recording sites.

Title: Improving Cross-Domain Generalization in Avian Bioacoustics using Domain Adversarial Training on the BirdSet Benchmark

Abstract:

Models trained on curated focal recordings (like BirdSet's XC training data) often suffer performance drops when deployed on real-world soundscapes due to covariate and domain shifts. This project investigates the effectiveness of Domain Adversarial Training (DAT) in mitigating this domain shift for avian bioacoustic classification within the BirdSet benchmark. We will implement a DAT framework, training a model to learn features invariant between the XC focal domain and a target PAM soundscape domain (e.g., SSW). We hypothesize that DAT will significantly improve cross-domain generalization compared to standard supervised training.

Experiments:

Implement a Domain Adversarial Training framework using a standard audio model architecture (e.g., ConvNext from the paper) with a gradient reversal layer, suitable for the BirdSet multi-label task.
Train the model adversarially using BirdSet's XCL as the source domain and unlabeled segments from a target soundscape dataset's training split (e.g., SSW training split) as the target domain.
Compare the model's performance (cmAP, AUROC) on the target soundscape test set (SSW) against a baseline model trained only on XCL using standard supervised learning.

Related Work:

This project directly addresses the domain shift challenge discussed in the BirdSet paper (Sec 2.3). It builds upon the BirdSet benchmark by applying domain adaptation techniques, specifically Domain Adversarial Neural Networks (DANN), which have shown success in other domains like computer vision but are less explored in large-scale bioacoustics benchmarks like BirdSet. The novelty lies in rigorously evaluating DAT's effectiveness for bridging the focal-to-soundscape gap in this context.

Risks/Limits:

Domain Adversarial Training can be unstable and sensitive to hyperparameter tuning (e.g., adaptation weight).
Defining appropriate target domain data for unsupervised adaptation (requires careful selection from soundscape datasets).
Performance improvements might be marginal if domain features are highly entangled with class features.
Requires significant computational resources for training multiple iterations of large models.

Hypothesis:

Applying Domain Adversarial Training between BirdSet's XCL (source) and SSW (target) domains will yield a >10% relative improvement in cmAP on the SSW test set compared to a identically architected model trained only supervisedly on XCL.

Title: Evaluating the Impact of Audio Data Augmentation on Avian Bioacoustic Classification using the BirdSet Benchmark

Abstract:

Data augmentation is crucial for improving model robustness, especially when training and testing distributions differ, as noted in the BirdSet paper (focal vs. soundscape). This project investigates the impact of common audio data augmentation techniques on avian bioacoustic classification performance using the BirdSet benchmark. We will train a standard Convolutional Neural Network (CNN) on a subset of BirdSet's training data (XCM) with and without specific augmentations (e.g., noise mixing, pitch shifting). Performance will be evaluated on a soundscape test set (e.g., NES) to quantify the effect of augmentation.

Experiments:

Implement two audio augmentation techniques suitable for bird sounds (e.g., mixing with background noise from BirdSet's VOX subset, applying pitch shifts within typical avian frequency ranges) using Python libraries like `torchaudio` or `audiomentations`.
Train a baseline CNN model (e.g., EfficientNet-B0 or a similar standard architecture) on the XCM subset of BirdSet without augmentations.
Train the same CNN architecture on the XCM subset, applying the implemented augmentations during training.
Compare the multi-label classification performance (AUROC, cmAP) of the baseline and augmented models on the NES test dataset.

Related Work:

This project builds directly on the BirdSet paper by using its datasets (XCM, NES, VOX) and addressing the challenge of covariate shift between focal training data and soundscape test data. While the paper benchmarks models, it doesn't systematically evaluate various data augmentation strategies. This project fills that gap by applying standard augmentation techniques, discussed in general audio ML literature, specifically to the BirdSet context to improve model generalization.

Risks/Limits:

Requires access to a GPU for feasible training times, even on the XCM subset.
Choice of augmentation parameters (e.g., noise level, pitch shift range) can significantly impact results.
Limited to evaluating only a few augmentation types and one model architecture.
Performance improvements might be modest depending on the chosen augmentations and subset.

Hypothesis:

Applying a combination of background noise mixing (using BirdSet's VOX data) and moderate pitch shifting during training will increase the AUROC score by at least 3% on the NES test set compared to a baseline CNN model trained without these augmentations on the BirdSet XCM subset.

On Scaling Up 3D Gaussian Splatting Training

52.61

Gaussian Splatting, Machine Learning System, Distributed Training

infrastructure, software libraries, hardware, systems, etc.

8,8,8,8

4,4,5,5

3,3,3,4

2,3,4,4

3,4,4,4

Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?

52.59

llms, translation, low-resource, grammar, long-context, linguistics

foundation or frontier models, including LLMs

6,8,8

4,4,4

2,3,4

When Attention Sink Emerges in Language Models: An Empirical View

52.59

Attention Sink, Language Models, Empirical Study

foundation or frontier models, including LLMs

6,8,8

4,4,5

4,4,4

3,3,4

MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions

52.54

LLM evaluation, multi-table question answering; multi-hop question answering

datasets and benchmarks

8,8,8

4,4,5

3,4,4

3,3,4

LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization

50.55

optimization, LoRA

optimization

8,8,10

3,3,4

4,4,4

3,3,4

Transformers Provably Solve Parity Efficiently with Chain of Thought

50.41

transformers, chain of thought, parity, self-consistency

learning theory

8,8,10

3,3,4

3,4,4

3,3,4