Search Engines:
11-442 / 11-642

HW4: Reranking with BERT Design Issues

Software Architecture
helloBert
Document Representation
Score Aggregation
Development Sequence

Software Architecture

Reranking with BERT requires developing another reranker. The BERT reranker is used in longer pipelines than your LTR reranker (e.g., reranker_1, reranker_2, reranker_3), but that should not cause problems if your software just loops over a list of rerankers.

The BERT reranker architecture is approximately as follows.

  # initialization
  read parameters
  initialize the BERT tokenizer and BERT classifier

  # reranking for query q
  for each document d:
      fetch the body string from the index
      if required, fetch the title string from the index
      build a list of passages for d (see below)
      for each passage p:
          use bert to encode the (q, p) pair
          use bert to calculate a score for the (q, p) pair
      aggregate passage scores into a document score (see below)
  rerank the top n documents by the new scores

HW2 only tested your software with a single reranker, which may have permitted it to make assumptions that cause problems in HW4. HW4 will test your system with longer pipelines, e.g.,

BM25 → LTR → BertRr
BM25 → BertRr → LTR
BM25 → BertRr → BertRr

This shouldn't be a problem for most people. Your software should just loop over n rerankers, passing the ranking along to the next ranker in the pipeline. Don't make assumptions about the number of rerankers.

helloBert

helloBert is a simple example program that shows how to get a document's title or body string from the index, how to encode the (query, document) pair, and how to generate a BERT score for the pair. It demonstrates how to interact with the transformers library and BERT models used in this assignment. Your software will be a little more complex than helloBert.

helloBert has hardcoded parameters for the locations of the ClueWeb09 index and BERT model. Probably they match your configuration. If they do not, they are easy to change. You should be able to run helloBert using your existing Anaconda environment.

Document Representation

BERT has its own lexical processing, so document text is represented as strings, not index terms. index-cw09 contains string versions of each document field. For example:
Idx.getAttribute("title-string", 369903)

BERT transforms an input string into a sequence of WordPiece tokens. Often, a text token is transformed into more than one WordPiece token. For example, "surfboarding" produces ['surf', '##board', '##ing'].

BERT has high computational costs, so the length of inputs sequences is controlled. There are two limits on sequence length. The first limits the number of WordPiece tokens in a sequence to control computational costs; we call this BERT's maximum sequence length. The second limits the number of text tokens in a passage; we call this the passage length.

Note: In your experiments, BERT's maximum sequence length is always 512 WordPiece tokens. The passage length will vary in different experiments. If a passage of length n text tokens produces more than 512 WordPiece tokens, BERT truncates it to 512 WordPiece tokens. In this corpus, a single text token produces about 1.5 WordPiece tokens, on average. Thus, sequences of more than about 340 text tokens are likely to be truncated by BERT.

The reference code uses a simple approach to tokenizing the text.
token_list = some_text.split()

If a text is longer than the passage length, your search engine must divide it into passages. To avoid splitting a relevant chunk of text between two passages, overlapping passages are created. The distance between the start of passage_i and passage_i+1 is called the stride. For example, given an original text, a passage length of 5, and a stride of 3:
         Original text: a b c d e f g h i j k
         Passage₁: a b c d e
         Passage₂: d e f g h
         Passage₃: g h i j k
Edge case alert: Note that a Passage₄ consisting of "j k" was not created even though it is stride tokens beyond the start of Passage₃. A new passage is only created if its contents would not be covered by the previous passage. "j k" is fully covered by Passage₃, so a Passage₄ is not required.

The bertrr:psgLen and bertrr:psgStride parameters indicate how to form passages. The bertrr:psgCnt parameter indicates the maximum number of passages formed for a document. The search engine may truncate the document to control computational costs.

Passages may be formed from the body field; or by concatenating the title with text from the body field (with a space between). The bertrr:maxTitleLength parameter indicates whether and how to use the title field. If the value is 0, don't use the title. Otherwise, the value indicates the maximum number of title tokens. Note: The bertrr:psgLen parameter applies (only) to text from the body field (i.e., it does not include the title).

Although your search engine does passage formation at the token level, the input to the BERT tokenizer (AutoTokenizer) is a passage string, not a list of passage tokens. Your software must combine the passage tokens into a whitespace-separated string before passing it to BERT.

The training cases all have .topPsg files that show you the passages produced by the reference system. Each line of the file contains information about one passage. The line begins with the passage id (external_docid.passage_id) followed by the sequence of text tokens. For example, the first line of HW4-Train-0.topPsg begins as shown below.
clueweb09-en0006-02-32959.0 Compare Airline Fees & Flight Prices. Find ...
Within a document, passages are numbered sequentially beginning at 0.

Score Aggregation

If a document is divided into passages, a score aggregation strategy determines how to convert passage scores into document scores. Your HW4 system must support three score aggregation strategies: {firstp, avgp, maxp}, which use the {first, average, maximum} of the passage scores, respectively. The bertrr:scoreAggregation parameter indicates which aggregation method to use.

Development Sequence

Start by examining and running helloBert.py. This step confirms that your environment and BERT models work properly.

Next, develop the configurations that use .inRank files, to avoid problems with your initial ranker. When the .inRank configurations work properly, focus on the test cases that use your initial ranking. If your HW1 system worked properly, there shouldn't be much work here.

Start with the simplest Bert reranking configuration first. Focus on test cases that use firstp score aggregation and only the body field, because this avoids problems in score aggregation and some problems in passage formation. Then, perhaps do firstp with title+body. After that, maxp. Then, avgp.

Start with only one reranker (reranker_1). Once that works, test your system on pipelines that have two or more rerankers.

FAQ

If you have questions not answered here, see the HW4 FAQ and the Homework Testing FAQ.

Jamie Callan