Research

Projects

Danish Foundation Models

A project spanning over multiple institutions and Danish universities dedicated to developing state-of-the art Danish language technology.

foundationmodels.dk
Lex LLM

We are working together with the Danish National Lexicon, to develop a chatbot that will help users effectively navigate the vast amounts of curated knowledge on their website.

lex.dk
European City\(^2\)

AarhusNLP is funded by the Horizon Europe project EuropeanCity2 - leading work package 5 - which explores agent-based models of democracy in order to provide viable alternative to the challenges that our European democracies are facing. AarhusNLP creates synthetic preference data and uses LLMs to improve simulation realism and complexity.

eurocity2.eu
TEXT

We are part of the TEXT a Centre of Excellence funded by the Danish National Research Foundation that investigates text culture. AarhusNLP studies how variation in preferences and cultural context introduces hard problems for artificial intelligence.

arts.au.dk/en/text

Publication Highlights

MMTEB - Massive Multilingual Text Embedding Benchmark

Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M. ...

A large-scale multilingual expansion of MTEB, driven mainly by highly-curated community contributions covering 250+ languages.

Leaderboard
Code
Paper
The Scandinavian Embedding Benchmarks: Evaluating Multilingual and Monolingual Text Embedding for Scandinavian languages

Enevoldsen, K., Kardos, M., Muennighoff, N. & Nielbo, K. L.

SEB is a comprehensive framework that enables text embedding evaluation for Scandinavian languages across 24 tasks, 10 subtasks, and 4 task categories.

Leaderboard
Code
Paper
\(S^3\) - Semantic Signal Separation

Kardos, M., Kostkan, J., Vermillet, A.-Q., Nielbo, K., Enevoldsen, K. & Rocca, R.

A theory-driven topic modeling approach in neural embedding spaces, which conceptualizes topics as independent axes of semantic space.

Code
Paper
topicwizard - a Modern, Model-agnostic Framework for Topic Model Visualization and Interpretation

Kardos, M., Enevoldsen, K. & Nielbo, K.

A framework for model-agnostic topic model interpretation, that provides intuitive and interactive tools that help users examine the complex semantic relations between documents, words and topics learned by topic models.

Code
Paper

All Publications

Kardos, M., Kostkan, J., Vermillet, A.-Q., Nielbo, K., Enevoldsen, K., & Rocca, R. (2025). \(S^3\) -- semantic signal separation. Retrieved from https://arxiv.org/abs/2406.09556 ↩
Kardos, M., Enevoldsen, K. C., & Nielbo, K. L. (2025). Topicwizard -- a modern, model-agnostic framework for topic model visualization and interpretation. Retrieved from https://arxiv.org/abs/2505.13034 ↩
Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur, A., Stap, D., ... Muennighoff, N. (2025). MMTEB: Massive multilingual text embedding benchmark. The Thirteenth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=zl3pfz4VCV ↩
Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The scandinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding. The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=2WbuKAfOxP ↩
Kristensen-McLachlan, R. D., Canavan, M., Kardos, M., Jacobsen, M., & Aarøe, L. (2025). Are chatbots reliable text annotators? sometimes. PNAS Nexus, 4(4), pgaf069. https://doi.org/10.1093/pnasnexus/pgaf069 ↩
Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024). Context is key(NMF):: Modelling topical information dynamics in chinese diaspora media. In W. Haverals, M. Koolen, & L. Thompson (Eds.), Proceedings of the computational humanities research conference 2024 (pp. 829--847). Germany: CEUR-WS. ↩
Xiao, C., Chung, I., Kerboua, I., Stirling, J., Zhang, X., Kardos, M., ... Muennighoff, N. (2025). MIEB: Massive image embedding benchmark. Retrieved from https://arxiv.org/abs/2504.10471 ↩
Kostkan, J., Kardos, M., Mortensen, J. P. B., & Nielbo, K. L. (2023). OdyCy -- a general-purpose NLP pipeline for Ancient Greek. In S. Degaetano-Ortlieb, A. Kazantseva, N. Reiter, & S. Szpakowicz (Eds.), Proceedings of the 7th joint SIGHUM workshop on computational linguistics for cultural heritage, social sciences, humanities and literature (pp. 128--134). Dubrovnik, Croatia: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.latechclfl-1.14 ↩
Feldkamp, P., Lassche, A., Kostkan, J., Kardos, M., Enevoldsen, K., Baunvig, K., & Nielbo, K. (2024). Canonical status and literary influence: A comparative study of Danish novels from the modern breakthrough (1870--1900). In M. Hämäläinen, E. Öhman, S. Miyagawa, K. Alnajjar, & Y. Bizzoni (Eds.), Proceedings of the 4th international conference on natural language processing for digital humanities (pp. 140--155). Miami, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.nlp4dh-1.14 ↩