Skip to content

Research

Projects

  • Danish Foundation Models


    DFM logo

    A project spanning over multiple institutions and Danish universities dedicated to developing state-of-the art Danish language technology.

    foundationmodels.dk

  • Lex LLM


    Lex logo

    We are working together with the Danish National Lexicon, to develop a chatbot that will help users effectively navigate the vast amounts of curated knowledge on their website.

    lex.dk

  • European City\(^2\)


    euc2 logo

    AarhusNLP is funded by the Horizon Europe project EuropeanCity2 - leading work package 5 - which explores agent-based models of democracy in order to provide viable alternative to the challenges that our European democracies are facing. AarhusNLP creates synthetic preference data and uses LLMs to improve simulation realism and complexity.

    eurocity2.eu

  • TEXT


    text logo

    We are part of the TEXT a Centre of Excellence funded by the Danish National Research Foundation that investigates text culture. AarhusNLP studies how variation in preferences and cultural context introduces hard problems for artificial intelligence.

    arts.au.dk/en/text

Publication Highlights

  • MTEB logo MMTEB - Massive Multilingual Text Embedding Benchmark

    Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M. ...


    A large-scale multilingual expansion of MTEB, driven mainly by highly-curated community contributions covering 250+ languages.

    Leaderboard
    Code
    Paper

  • SEB logo The Scandinavian Embedding Benchmarks: Evaluating Multilingual and Monolingual Text Embedding for Scandinavian languages

    Enevoldsen, K., Kardos, M., Muennighoff, N. & Nielbo, K. L.


    SEB is a comprehensive framework that enables text embedding evaluation for Scandinavian languages across 24 tasks, 10 subtasks, and 4 task categories.

    Leaderboard
    Code
    Paper

  • S3 logo \(S^3\) - Semantic Signal Separation

    Kardos, M., Kostkan, J., Vermillet, A.-Q., Nielbo, K., Enevoldsen, K. & Rocca, R.


    A theory-driven topic modeling approach in neural embedding spaces, which conceptualizes topics as independent axes of semantic space.

    Code
    Paper

  • Topicwizard logo topicwizard - a Modern, Model-agnostic Framework for Topic Model Visualization and Interpretation

    Kardos, M., Enevoldsen, K. & Nielbo, K.


    A framework for model-agnostic topic model interpretation, that provides intuitive and interactive tools that help users examine the complex semantic relations between documents, words and topics learned by topic models.

    Code
    Paper

All Publications


  1. Kardos, M., Kostkan, J., Vermillet, A.-Q., Nielbo, K., Enevoldsen, K., & Rocca, R. (2025). \(S^3\) -- semantic signal separation. Retrieved from https://arxiv.org/abs/2406.09556 

  2. Kardos, M., Enevoldsen, K. C., & Nielbo, K. L. (2025). Topicwizard -- a modern, model-agnostic framework for topic model visualization and interpretation. Retrieved from https://arxiv.org/abs/2505.13034 

  3. Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur, A., Stap, D., ... Muennighoff, N. (2025). MMTEB: Massive multilingual text embedding benchmark. The Thirteenth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=zl3pfz4VCV 

  4. Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The scandinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding. The Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Retrieved from https://openreview.net/forum?id=2WbuKAfOxP 

  5. Kristensen-McLachlan, R. D., Canavan, M., Kardos, M., Jacobsen, M., & Aarøe, L. (2025). Are chatbots reliable text annotators? sometimes. PNAS Nexus, 4(4), pgaf069. https://doi.org/10.1093/pnasnexus/pgaf069 

  6. Kristensen-McLachlan, R. D., Hicke, R. M. M., Kardos, M., & Thunø, M. (2024). Context is key(NMF):: Modelling topical information dynamics in chinese diaspora media. In W. Haverals, M. Koolen, & L. Thompson (Eds.), Proceedings of the computational humanities research conference 2024 (pp. 829--847). Germany: CEUR-WS. 

  7. Xiao, C., Chung, I., Kerboua, I., Stirling, J., Zhang, X., Kardos, M., ... Muennighoff, N. (2025). MIEB: Massive image embedding benchmark. Retrieved from https://arxiv.org/abs/2504.10471 

  8. Kostkan, J., Kardos, M., Mortensen, J. P. B., & Nielbo, K. L. (2023). OdyCy -- a general-purpose NLP pipeline for Ancient Greek. In S. Degaetano-Ortlieb, A. Kazantseva, N. Reiter, & S. Szpakowicz (Eds.), Proceedings of the 7th joint SIGHUM workshop on computational linguistics for cultural heritage, social sciences, humanities and literature (pp. 128--134). Dubrovnik, Croatia: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.latechclfl-1.14 

  9. Feldkamp, P., Lassche, A., Kostkan, J., Kardos, M., Enevoldsen, K., Baunvig, K., & Nielbo, K. (2024). Canonical status and literary influence: A comparative study of Danish novels from the modern breakthrough (1870--1900). In M. Hämäläinen, E. Öhman, S. Miyagawa, K. Alnajjar, & Y. Bizzoni (Eds.), Proceedings of the 4th international conference on natural language processing for digital humanities (pp. 140--155). Miami, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.nlp4dh-1.14