Question: Which base is used for the BM25 log function?
Answer: Jamie's software uses Pythons's math.log function,
which returns the natural logarithm.
Question: I am having trouble understanding the relationship
between implementing the #SUM operator and BM25. Should the formula
shown in the slides that takes in K_1, K_3, constants(N, avg_doclen)
etc. be done in the #SUM operator or the #SCORE operator?
If it is done in the #SCORE operator, what does this leave for the
#SUM operator to do?
Answer:
Most of the calculation of the BM25 score for a single query
term is done in the #SCORE operator, specifically, the RSJ weight
and the tf weight.
The qtf portion of the calculation is more like the weight in the
Indri #WSUM operator. When qtf=1 for all query terms, the qtf portion
of the BM25 formula is a constant, and thus irrelevant. You can put it
anywhere, or omit it. It doesn't change the ranking.
When qtf can vary, it is more natural to treat BM25 as having a
#WSUM operator. The query parser passes qtf (or any arbitrary weight)
to the #WSUM operator. The #WSUM operator then uses that weight in
the qtf portion of the BM25 calculation.
And, of course, the #SUM (or #WSUM) operator adds up the scores of
the individual query terms to produce a score for the query.
Question:
The getDefaultScore method in the QrySopScore class doesn't take
the term as input. How can that function get the statistics it
needs to compute the default score?
Answer:
First, the argument to a #SCORE operator may be a term, or it may
be some other query argument that produces an inverted list.
It is a mistake to think about terms. Think about the inverted
list that the #SCORE operator will use to create a score list.
When the SCORE operator is initialized, its (only) argument is evaluated, and it produces an inverted list that contains ctf. Store it so that you have it later, when QrySopScore.getDefaultScore needs it.
Efficiency hint: You may wish to also store |C|, instead of looking it up repeatedly. This will speed up your code a little.
Question:
What is the corpus frequency of #NEAR?
Answer:
A #NEAR operator produces an inverted list. Inverted lists contain
ctf.
Question:
What is the default score of a #NEAR operator?
Answer:
A #NEAR operator produces an inverted list. Inverted lists don't
have default scores. Perhaps you are wondering about the default
score of the #SCORE operator that encloses the #NEAR operator.
Question:
When evaluating query operators like #AND and #OR, if the score
list for term A contains document 21, but the score list for term B
does not have document 21, how should I calculate the scores?
Answer:
See the notes from the second Best-Match lecture.
Question:
Is document length used in default score calculations field dependant?
Answer:
Yes, different fields have different document lengths. If your query was
#AND(a.body b.title), a.body and b.title would use the document lengths
for the body field and the title field, respectively. Currently, this
information is only available in the #TERM operator inverted list; you should
change your code so that it stores this information for the #SCORE operator.
Question:
My software is much slower than your software. What might be
the cause?
Answer: The most common problem is doing some calculations
repeatedly. For example, the RSJ (idf) weight and some of the Indri
smoothing probabilities only need to be computed once for
each query term, but a straightforward implementation of the
formulas cause them to be calculated for each document.
Whenever you implement a new calculation, think about whether it
needs to be done just once (i.e., it is some kind of constant), or
whether it is (query-term, document-id) specific. When you compute
constants and where you cache them is up to you. One choice for
the RSJ and smoothing weights is to compute and store them when the
QrySopScore operator is initialized. There are other equally good
choices, so do what best fits your software design.
Question:
The software test web service is not printing any output!
Answer:
The most common problem is that your code took too long, and thus
was terminated. There should be an error message that explains this.
If you really had a submission that just stopped printing output and
never completed, send Jamie an email with your submission timestamp
so that he can take a look. That may mean that your software died or
did something else in an unexpected way.
Question:
Is there a way to view the document text?
Answer:
You can use InspectIndex. Your index contains the clean (non-HTML)
text for each field. For example:
python InspectIndex.py -index INPUT_DIR/index-cw09 -list-attribute 25 title-string
If the FAQ hasn't answered your question, please search the Piazza forum to see if someone has already answered your question before you ask it.
Copyright 2024, Carnegie Mellon University.
Updated on January 31, 2024