Comparing
ViNTs with MDR
Below is the test report for
the testing ViNTs and MDR on data set 3. SRR means search result records though out this page.
MDR extracts a single page a
time. To comparable with MDR, we configured ViNTs to run on single pages,
i.e. building wrapper from a single page, then applying built wrapper to
extract result from that page. ViNTs returns only the SRRs in the major
category of a web page, while MDR reports all identified categories. We count
the major category only if there are multiple categories of SRRs. The table 1
below shows the results of running ViNTs and MDR on data set 3. The
columns labeled as SRRs list the actual numbers of SRRs in web pages. The
meaning of other labels are: VWE – extracted SRR number by ViNTs, VWC –
correct SRR number by ViNTs, MDE – extracted SRR number by MDR, and MDC –
correct SRR number by MDR.
Table
1. Comparison results with MDR
|
Web site |
SRRs |
VW E |
VW C |
MD E |
MD C |
Web site |
SRRs |
VW E |
VW C |
MD E |
MD C |
|
Agents |
38 |
38 |
38 |
* |
* |
gamelan |
10 |
10 |
10 |
10 |
10 |
|
Alphabet |
10 |
10 |
10 |
10 |
10 |
Google+ |
10 |
8 |
8 |
0 |
0 |
|
Alphaworks |
9 |
9 |
9 |
* |
* |
goto |
40 |
40 |
40 |
* |
* |
|
Amazold+ |
6 |
6 |
6 |
0 |
0 |
Hotbot+ |
10 |
14 |
9 |
0 |
0 |
|
Amazon |
50 |
50 |
50 |
50 |
0 # |
ibm |
4 |
4 |
4 |
4 |
4 |
|
aw |
10 |
10 |
10 |
10 |
10 |
Infoseek+ |
10 |
10 |
10 |
0 |
0 |
|
barnes |
19 |
19 |
19 |
16 |
16 |
Itn+ |
10 |
9 |
9 |
0 |
0 |
|
bookbuyer |
28 |
28 |
28 |
* |
* |
King+ |
19 |
19 |
19 |
19 |
19 |
|
bookpool |
25 |
25 |
25 |
25 |
25 |
lc |
20 |
19 |
19 |
* |
* |
|
Borders+ |
50 |
50 |
50 |
0 |
0 |
Lycos+ |
10 |
10 |
10 |
0 |
0 |
|
Canoe |
20 |
20 |
20 |
20 |
20 |
MagazineOutlet |
12 |
12 |
12 |
3 |
3 |
|
canoe2+ |
8 |
8 |
8 |
0 |
0 |
msn |
50 |
51 |
50 |
50 |
50 |
|
cbcconsumer |
7 |
7 |
7 |
7 |
7 |
Powells+ |
8 |
8 |
8 |
0 |
0 |
|
Chapters+ |
20 |
20 |
20 |
0 |
0 |
quote |
10 |
10 |
10 |
10 |
10 |
|
cnet2+ |
15 |
15 |
15 |
0 |
0 |
rubylane |
25 |
24 |
24 |
25 |
25 |
|
cnetGames |
3 |
5 |
2 |
4 |
3 |
signpost |
12 |
11 |
11 |
* |
* |
|
CnetTech+ |
15 |
15 |
15 |
0 |
0 |
thestar |
50 |
49 |
49 |
50 |
50 |
|
Cody |
20 |
20 |
20 |
19 |
19 |
vancouversun |
4 |
5 |
4 |
4 |
4 |
|
Dwjava |
14 |
14 |
14 |
14 |
14 |
vunet |
10 |
10 |
10 |
* |
* |
|
Dwxml |
16 |
16 |
16 |
16 |
16 |
wine |
10 |
10 |
10 |
10 |
10 |
|
Ebay |
3 |
3 |
3 |
3 |
3 |
Yahoo+ |
15 |
15 |
15 |
0 |
0 |
|
Etoys |
9 |
7 |
7 |
7 |
7 |
Yahoo2+ |
20 |
20 |
20 |
0 |
0 |
|
Excite |
10 |
10 |
10 |
* |
* |
yahooAction |
50 |
50 |
50 |
50 |
50 |
|
Fatbrain+ |
25 |
24 |
24 |
0 |
0 |
Zbooks |
28 |
28 |
28 |
3 |
0 ^ |
|
gameCenter |
35 |
35 |
35 |
35 |
35 |
zshop |
50 |
50 |
50 |
5 |
0 ^ |
Table 2. Summary
|
|
ViNTs |
MDR |
|
#SRRs |
795 |
795 |
|
#Extracted SRRs |
795 |
479 |
|
#Correct SRRs |
785 |
420 |
|
Recall |
98.7% |
52.8% |
|
Precision |
98.7% |
87.7% |
By analyzing the underline
HTML structures, that enwrap the major SRRs in the 42 web pages that MDR
produced results, we found that 26 pages uses HTML table and form related tags
to enwrap the SRRs, while 16 pages (marked by a “+” behind their web site name)
uses other HTML tags such as P, LI, etc. Table 3 shows the test results of
ViNTs and MDR on the two types of web pages respectively. The columns
labeled as TF lists the results for the 26 table and form enwrapped pages,
while the columns labed as NTF lists the results for the 16 pages, which use
other types of tags to enwrap data.
Table 3. Categorized summary
|
|
ViNTs |
MDR |
||
|
TF |
NTF |
TF |
NTF |
|
|
#SRRs |
544 |
251 |
544 |
251 |
|
#Extracted SRRs |
544 |
251 |
460 |
19 |
|
#Correct SRRs |
539 |
246 |
401 |
19 |
|
Recall |
99.1% |
98.0% |
73.7% |
7.7% |
|
Precision |
99.1% |
98.0% |
87.2% |
100% |