Datasets and Resources
 Topics for iKAT Year 3 (2025)
 Topics for iKAT Year 3 (2025)
| File | Description | 
|---|---|
| 2025_test_topics.json | Test topics in JSON format. | 
Additional Data
We also provide additional data from iKAT Year 1 and 2 and TREC CAsT 2019-2022.
TREC iKAT ClueWeb22-B Passage Collection
We provide a segmented version of the TREC iKAT ClueWeb22-B Document collection available from CMU in two formats: JSONL and TrecWeb. 
In case you have segmented the document collection yourself, you may check whether your segments match ours using the tsv file of passage hashes provided. 
- Passage collection in JSONLformat.- These files contain passages in the form {"id": "[passage id]", "contents": "[passage text]", "url": "[ClueWeb22 document URL]"}
- Passage IDs are structured as: doc_id:passage_number
- Total download size is approximately 31 GB.
- Total number of passages is 116,838,987
 
- These files contain passages in the form 
- 
Passage collection in TrecWebformat.- Format as shown on website.
- Total download size is approximately 31 GB
 
- 
Passage hashes. - tsvfile containing MD5 hashes of passage texts.
- The .tsvfile has this format:doc_id passage_number passage_MD5
- Total download size is 2.2 GB
 
Pre-built Indices
We provide the following indices to help participants get started.
| File | Description | Size (Approximate) | 
|---|---|---|
| Lucene passage index ( ikat_2023_passage_index.tar.bz2) | Sparse Lucene index generated from the JSONLpassage files above using Pyserini. The files form a single.tar.bz2archive split into sections for simpler downloading due to the overall size. To extract the archive, once downloaded, combine each of the sections in name order back into a single file using the following command:cat ikat_2023_passage_index.tar.bz2.part* > ikat_2023_passage_index.tar.bz2. | 150 GB | 
| DeepCT Index ( pt_deepct.tar.bz2) | DeepCT index built using PyTerrier | 70 GB | 
| SPLADE Index ( pt_splade.tar.bz2) | SPLADE++ index built using PyTerrier | 138 GB | 
| SPLADE index ( splade_index.tar.bz2) | SPLADE++ index built using official SPALDE code | 97 GB | 
Code to build and search indices
We open-source the code used to build the indices above. You can find the code here
How do I access these resources?
Each team should use a URL of https://ikattrecweb.grill.science/<team_name> to access the files. The page will ask for a userID and password. Enter the login details which you obtained from the iKAT organizers. You should see a page which lists each type of data and has links to the individual files listed above, along with their checksum files.
NOTE: Please do not share IPs in the 10.x.x.x range which is for private networks. We would need a suitable public IP so that we may configure the above download link to work for you.
iKAT Searcher
iKAT Searcher is a simple tool developed to help with creating the topics for iKAT. The tool allows topic developers to visually assess the behaviour of a retrieval system, ultimately making it easier to develop challenging, but interesting, topics for the Track. You can interact with the system here. See the GitHub repository.
Run Validation
We provide code for run validation in our Github repository. Please see the associated README file for detailed instructions on how to run the code. It is crucial to validate your submission files before submitting them. The run files that fail the validation phase will be discarded. We advise you to get familiarized with the validation script as soon as possible and let us know if you have any questions or encounter any problems working with it.
Note. You need the MD5 hash file of the passages in the collection to run the validation code. You can download this file from above.
Below is a summary of the checks that the script performs on a run file.
- Can the file be parsed as valid JSON?
- Can the JSON be parsed into the protocol buffer format defined in protocol_buffers/run.proto?
- Does the run file have at least one turn?
- Is the run_namefield non-empty?
- Is the run_typefield non-empty and set toautomaticormanual?
- Does the number of turns in the run match the number in the test topics file?
- Does the number of turns in each topic in the run match the corresponding number in the test topics file?
- For each turn in the run:
- Is the turn ID valid and matches an entry in the topics file?
- Is any turn ID higher than expected for the selected topic (e.g. turns 1-5 in the topic, but a turn has ID 6 in the run file)?
- (optional, enabled by default) Do all the passage_provenancepassage IDs appear in the collection?
- For each response in the turn:- Does it have a rank > 0?
- Do the ranks increase with successive responses?
- Does the response have a non-empty textfield?
- For each passage_provenanceentry:
- Does it have a score less than the previous entry?
- Does it have a passage ID containing a single colon and beginning with 'clueweb22-'?
- Are there less than 1000 passage_provenanceentries listed for the response?
- Is there at least one passage_provenancewith itsusedfield set to True in the response?
- Does the response have at least one passage_provenanceentry?
- Does the response have at least one ptkb_provenanceentry?
- For each ptkb_provenanceentry:
- Does it have a non-empty ID?
- Does the ID appear in the ptkbfield of the topic data?
- Does the text given match that in the topic data?
- Does it have a score less than the previous entry?