
LLM Researcher
- San Sebastián, Guipúzcoa
- Permanente
- Tiempo completo
With 180+ employees and growing, our team is fully multicultural and international. We deliver hyper-efficient software for companies seeking a competitive edge through quantum computing and artificial intelligence.
Our flagship products, CompactifAI and Singularity, address critical needs across various industries:
- CompactifAI is a groundbreaking compression tool for foundational AI models based on Tensor Networks. It enables the compression of large AI systems—such as language models—to make them significantly more efficient and portable.
- Singularity is a quantum- and quantum-inspired optimization platform used by blue-chip companies to solve complex problems in finance, energy, manufacturing, and beyond. It integrates seamlessly with existing systems and delivers immediate performance gains on classical and quantum hardware.
We’re committed to building a truly inclusive culture—come and join us.As a Senior LLM Researcher , you will
- Design and implement strategies for creating, sourcing, and augmenting datasets tailored for LLM training and fine-tuning.
- Develop scalable pipelines to collect, clean, filter, annotate, and validate large volumes of text data.
- Conduct data audits to ensure quality, diversity, ethical compliance, and bias mitigation.
- Collaborate with ML engineers and researchers to align datasets with training objectives and model evaluation needs.
- Use tools like Active Learning, synthetic data generation, and self-supervised learning to maximize dataset efficiency.
- Leverage human-in-the-loop (HITL) workflows for data labeling and validation where necessary.
- Contribute to building data documentation and metadata standards (e.g., Datasheets for Datasets).
- Keep up to date with research trends in dataset curation, LLM pretraining data, and benchmarking.
- Bachelor’s, Master’s, or Ph.D. in Computer Science, AI, Data Science, or a related field.
- 3+ years of experience in data science, machine learning, or related roles, with demonstrated experience in dataset creation for NLP or LLMs.
- In-depth knowledge of the LLM lifecycle: pretraining, fine-tuning, alignment, and evaluation.
- Proficient in Python and data tooling ecosystems (Pandas, NumPy, spaCy, Hugging Face Datasets & Transformers).
- Hands-on experience with text data collection from diverse sources: web scraping, APIs, proprietary corpora, etc.
- Strong understanding of data quality metrics including bias detection, toxicity, and readability.
- Experience working with annotation tools (e.g., Prodigy, Label Studio) and managing annotation teams or workflows.
- Experience building or contributing to datasets used in LLM pretraining or supervised fine-tuning.
- Familiarity with RLHF workflows and alignment techniques (e.g., preference modeling, reward modeling).
- Exposure to multilingual and low-resource language datasets.
- Contributions to open-source datasets, tools, or publications in dataset-centric research.
- Knowledge of ethical AI, data governance, privacy laws (e.g., GDPR), and responsible data use.
- Indefinite contract.
- Equal pay guaranteed.
- Variable performance bonus.
- Signing bonus.
- We offer work visa sponsorship (If applicable).
- Relocation package (if applicable).
- Private health insurance.
- Eligibility for educational budget according to internal policy.
- Hybrid opportunity.
- Flexible working hours.
- Language classes and discounted lunch options
- Working in a high paced environment, working on cutting edge technologies.
- Career plan. Opportunity to learn and teach.
- Progressive Company. Happy people culture
+27 languages