Apr 18, 2010

Muddiest Point - Final Week

Once again, I don't have any real muddy points from this last week.

Thanks!

Apr 4, 2010

Muddiest Point - Week ?

Now, this isnt a muddiest point, but i was filling out my taxes using Turbo Tax and became fascinated by the IR system they use for Q&A. The system appears to tailor search results based on the portion of the tax form you are working on and after you select a link for an answer there is a relevance feedback element for the search result. I feel a little embarrassed by how fascinated I am by the Turbo Tax website!

Mar 21, 2010

Readings - Web search and link analysis

The web search basics chapter (19 in IIR) was pretty much review at this point, so I don't think there is really much to say about it. Not that review is not useful, it certainly help refresh the memory.

As for the other readings, they are rather heady and I suspect that Dr. He's lecture will do much to clarify the details.

Muddiest Point - Unit 8

Apparently I read the right chapter from the wrong book last week, WOOPS! Anyway, no muddiest points for me this week.

Mar 14, 2010

Readings - XML Retrieval

The IIR chapter on XML retrieval contained a number of basic elements about XML and XML parsing tools as an introduction then launched into an extensive discussion of XML retrieval methods.

One element of confusion I had was with the Structured Document Retrieval Principle. The principle itself is not at all confusing, but the example given in the text seems to be the opposite of the ideals of the principle. If the idea is to provide the most specific element vis a vis the query, why would a query for Macbeth return the Title Macbeth rather than the Scene Macbeth's Castle, which is a more specific element (i.e. further down the element tree)?

The discussion of the vector space model of XML retrieval is a little confusing, but I suspect this will be clarified in the lecture this week.

It would also be nice have a little more discussion of data-centric XML retrieval. The chapter basically blows this off as something this is best not handled in XML retrieval, but maybe we could talk about that a little bit. I am curious, even, what a data centric XML file would look like, given that most data is tabular and linked across fields.

Muddiest Point - Unit 7

The only issue I have at this point is with the homework. Given that all of the assignments are, to a degree, dependent upon successful completion of the previous assignment, I think it would be helpful to get feedback (grades, comments) relatively soon.

Thanks!

Feb 21, 2010

Readings - Relevance Feedback and Query Expansion

I found the readings this week to be informative, as I had never considered the details of how user-feedback is or could be incorporated in to IR. I specifically found the discussion of pseudo and implicit relevance feedback to be interesting. I wonder about the tradeoffs between query efficiency and retrieval success in pseudo relevance feedback given that one must, presumably, run two queries to get one result. Is  this efficiency not really an issue?

I also found the discussion of thesaurus-based query expansion to be interesting. I have seen some of this in my work with bibliographic databases, but might look into it a little more now that I understand how it works.

Feb 19, 2010

Muddiest Point - Unit 6

I am still interested in hearing why, when comparing results of IR systems using MAP and other averaging statistics, the IR community does not also look at variability about the mean. Simple statistics can tell us a great deal about whether things are truly 'different' in systems like this or how significant those differences are. Is this truly not an issue that is discussed in this field?

Feb 14, 2010

Readings - Evaluation


The readings on evaluation this week were interesting, if a little too, i think, focused on objective evaluation. In IIR the discussion focused on standard test collections, precision and recall statistics, and a number of other statistics such as the precision-recall graph and mean average precision. This was followed by discussion of relevance and the problems associated with human-based evaluation. A few things in this reading came to mind that I wouldn't mind seeing discussed in class. First, given that most of the test collections discussed in the reading are from news wire or other news sources, is there any concern, and has there been any study, about potential bias of evaluation statistics based on the content of these documents? Second, the reading seemed to downplay pretty heavily the utility of human-based evaluation. Isn't there a place for this subjective evaluation given that this is how these systems are evaluated in the end-game (by human users)?


Muddies Points - Unit 5

No muddiest points this week. The lectures have been doing a great job of clarifying the readings.

Jan 31, 2010

Muddiest Points - Unit 4

I rather enjoyed class this week and any questions I had about the details of the reading were cleared up in discussion. No muddiest points this time!

Jan 24, 2010

Readings - Unit 3

The readings for this week, while a little complex, were fairly interesting. It is nice to see metadata fields or other structured fields brought into the discussion as they seemed a blatant exemption up to this point. Weighting of the zones of a document, also, seemed like an obvious improvement on the simple models we have discussed until now. I found the description of 'machine-learned relevance' to be overly complicated - is this just a simple calibration process (i.e. build the weights based on expert opinion for a known document set then calibrate the parameters for the calculation of weights to match the expert set)?

The discussion of the vector space model was slightly perplexing, though I think I understand the basics. I hope we can clarify in slightly simpler terms in class.

Muddiest Points - Unit 3

The lecture and readings for this last week were all fairly clear to me. While I understand Heap's Law, I still am not clear on the practical use of the law. Is it used merely for estimating the vocabulary size of the collection in order to help determine memory and storage requirements for the eventual postings list and dictionary?

Jan 17, 2010

Readings - Unit 2

The chapter on indexing methods was interesting. I like reading about the different methods and about the hardware constraints on indexing. I am not clear, though, on what the practical implications of these hardware constraints are. Clearly we are talking about fractions of seconds difference among indexing methods, and these add up to be significant issues, but what types of real-world differences are we talking about?

As for the compression chapter. Not much to say here.

I am enjoying this class greatly!

Muddiest Points - Unit 2

I don't have any muddiest points this week. Maybe next week though!

Jan 10, 2010

IR: Reading Notes Unit 2

Reading through the assignment for Introduction to Information Retrieval I was struck by how many of the basic indexing, posting lists, and tolerant retrieval I had covered in my previous programming work. Given that the details that we have been exposed to in this first reading assignment are likely quite elementary, I am curious to see how complicated these IR systems can become.

One point that was a touch confusing to me in the reading had to do with the permuterm index. I am not entirely clear on how rotating the term in question as mentioned on pages 53 and 54 of IIR helps resolve the wildcard query problem. Any chance we could have this explained a little more clearly in class this week?