Search Engines:
11-442 / 11-642 / 11-742
 
CMU logo
 

HW5: Reranking with BERT Design Issues

Software Architecture
helloBert
Document Representation
Score Aggregation
Development Sequence

 

Software Architecture

Reranking with BERT requires developing yet another reranker. You are getting good at these by now. The BERT reranker will be used in longer pipelines than your LTR and Diversity rerankers (e.g., reranker_1, reranker_2, reranker_3), but that should not cause problems if your software just loops over a list of rerankers.

The BERT reranker architecture is approximately as follows.

  # initialization
  read parameters
  initialize the BERT tokenizer and BERT classifier

  # reranking for query q
  for each document d:
    fetch the body string from the index
    if required, fetch the title string from the index
    build a list of passages (see below)
    for each passage p:
      use bert to encode the (q, p) pair
      use bert to calculate a score for the (q, p) pair
    aggregate passage scores into a document score (see below)
  rerank the top n documents by the new scores

HW3 only tested your software with a single reranker, which may have permitted it to make assumptions that cause problems in HW5. HW5 will test your system with longer pipelines, e.g.,

This shouldn't be a problem for most people. Your software should just loop over n rerankers, passing the ranking along to the next ranker in the pipeline. Don't make assumptions about the number of rerankers.

 

helloBert

helloBert is a simple example program that shows how to get a document's title or body string from the index, how to encode the (query, document) pair, and how to generate a BERT score for the pair. It demonstrates how to interact with the transformers library and BERT models used in this assignment. Your software will be a little more complex than helloBert.

helloBert has hardcoded parameters for the locations of the ClueWeb09 index and BERT model. Probably they match your configuration. If they do not, they are easy to change. You should be able to run helloBert using your existing Anaconda environment.

 

Document Representation

BERT has its own lexical processing, so document text is represented as strings, not index terms. index-cw09 contains string versions of each document field. For example:
         Idx.getAttribute("title-string", 369903)

BERT transforms an input string into a sequence of WordPiece tokens. Often, a text token is transformed into more than one WordPiece token. For example, "surfboarding" produces ['surf', '##board', '##ing'].

BERT has high computational costs, so the length of inputs sequences is controlled. There are two limits on sequence length. The first limits the number of WordPiece tokens in a sequence to control computational costs; we call this BERT's maximum sequence length. The second limits the number of text tokens in a passage; we call this the passage length.

Note: In your experiments, BERT's maximum sequence length is always 512 WordPiece tokens. The passage length will vary in different experiments. If a passage of length n text tokens produces more than 512 WordPiece tokens, BERT truncates it to 512 WordPiece tokens. In this corpus, a single text token produces about 1.5 WordPiece tokens, on average. Thus, sequences of more than about 340 text tokens are likely to be truncated by BERT.

The reference code uses a simple approach to tokenizing the text.
     token_list = some_text.split()

If a text is longer than the passage length, the search engine divides it into passages. To avoid splitting a relevant chunk of text between two passages, overlapping passages are created; the distance between the start of passagei and passagei+1 is called the stride. For example, given an original text, a passage length of 5, and a stride of 3:
         Original text: a b c d e f g h i j
         Passage1: a b c d e
         Passage2: d e f g h
         Passage3: g h i j
Edge case alert: Note that a Passage4 consisting of "j" was not created even though it is stride tokens beyond the start of Passage3. A new passage is only created if its contents would not be covered by the previous passage. "j" is fully covered by Passage3, so a Passage4 is not required.

Often, the stride is half of the passage length, but that is not required. The bertrr:psgLen and bertrr:psgStride parameters indicate how to form passages. The bertrr:psgCnt parameter indicates the maximum number of passages formed for a document (i.e., the search engine may truncate the document to control computational costs).

Passages may be formed from the body field; or by concatenating the title with text from the body field (with a space between). The bertrr:bertrr:maxTitleLength parameter indicates whether and how to use the title field. If the value is 0, don't use the title. Otherwise, the value indicates the maximum number of title tokens. Note: The bertrr:psgLen parameter applies (only) to text from the body field (i.e., it does not include the title).

Although passage formation is done at the token level, the input to the BERT tokenizer (AutoTokenizer) is a passage string, not a passage token list. Your software must combine the passage tokens into a whitespace-separated string before passing it to BERT.

 

Score Aggregation

If a document is divided into passages, a score aggregation strategy determines how to convert passage scores into document scores. Your HW5 system must support three score aggregation strategies: {firstp, avgp, maxp}, which use the {first, average, maximum} of the passage scores, respectively. The bertrr:scoreAggregation parameter indicates which aggregation method to use.

 

Development Sequence

Start by examining and running helloBert.py. This step confirms that your environment and BERT models work properly.

Next, develop the configurations that use .inRank files, to avoid problems with your initial ranker. When the .inRank configurations work properly, focus on the test cases that use your initial ranking. If your HW2 system worked properly, there shouldn't be much work here.

Start with the simplest Bert reranking configuration first. Focus on test cases that use firstp score aggregation and only the body field, because this avoids problems in score aggregation and some problems in passage formation. Then, perhaps do firstp with title+body. After that, maxp. Then, avgp.

Start with only one reranker (reranker_1). Once that works, test your system on pipelines that have two or more rerankers.

 

FAQ

If you have questions not answered here, see the HW5 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on April 02, 2024

Jamie Callan