Comparing ViNTs with MDR

 

Below is the test report for the testing ViNTs and MDR on data set 3.  SRR means search result records though out this page.

 

MDR extracts a single page a time. To comparable with MDR, we configured ViNTs to run on single pages, i.e. building wrapper from a single page, then applying built wrapper to extract result from that page. ViNTs returns only the SRRs in the major category of a web page, while MDR reports all identified categories. We count the major category only if there are multiple categories of SRRs. The table 1 below shows the results of running ViNTs and MDR on data set 3. The columns labeled as SRRs list the actual numbers of SRRs in web pages. The meaning of other labels are: VWE – extracted SRR number by ViNTs, VWC – correct SRR number by ViNTs, MDE – extracted SRR number by MDR, and MDC – correct SRR number by MDR.

Table 1. Comparison results with MDR

Web site

SRRs

VW E

VW C

MD E

MD C

Web site

SRRs

VW E

VW C

MD E

MD C

Agents

38

38

38

*

*

gamelan

10

10

10

10

10

Alphabet

10

10

10

10

10

Google+

10

8

8

0

0

Alphaworks

9

9

9

*

*

goto

40

40

40

*

*

Amazold+

6

6

6

0

0

Hotbot+

10

14

9

0

0

Amazon

50

50

50

50

0 #

ibm

4

4

4

4

4

aw

10

10

10

10

10

Infoseek+

10

10

10

0

0

barnes

19

19

19

16

16

Itn+

10

9

9

0

0

bookbuyer

28

28

28

*

*

King+

19

19

19

19

19

bookpool

25

25

25

25

25

lc

20

19

19

*

*

Borders+

50

50

50

0

0

Lycos+

10

10

10

0

0

Canoe

20

20

20

20

20

MagazineOutlet

12

12

12

3

3

canoe2+

8

8

8

0

0

msn

50

51

50

50

50

cbcconsumer

7

7

7

7

7

Powells+

8

8

8

0

0

Chapters+

20

20

20

0

0

quote

10

10

10

10

10

cnet2+

15

15

15

0

0

rubylane

25

24

24

25

25

cnetGames

3

5

2

4

3

signpost

12

11

11

*

*

CnetTech+

15

15

15

0

0

thestar

50

49

49

50

50

Cody

20

20

20

19

19

vancouversun

4

5

4

4

4

Dwjava

14

14

14

14

14

vunet

10

10

10

*

*

Dwxml

16

16

16

16

16

wine

10

10

10

10

10

Ebay

3

3

3

3

3

Yahoo+

15

15

15

0

0

Etoys

9

7

7

7

7

Yahoo2+

20

20

20

0

0

Excite

10

10

10

*

*

yahooAction

50

50

50

50

50

Fatbrain+

25

24

24

0

0

Zbooks

28

28

28

3

0 ^

gameCenter

35

35

35

35

35

zshop

50

50

50

5

0 ^

MDR failed to give out any output on 8 web pages due to program exceptions, we mark the corresponding cell with a “*”, while ViNTs worked on all 50 pages. MDR misaligned the SRRs in amazon (marked with a “#”) such that an extracted SRR consisted of part of an actual SRR and part of the next SRR. It combined multiple SRRs in one extracted SRR in zbooks and zshop (marked with a “^”). Table 2 shows the result summary of the 42 web pages that MDR gave results.

Table 2. Summary

 

ViNTs

MDR

#SRRs

795

795

#Extracted SRRs

795

479

#Correct SRRs

785

420

Recall

98.7%

52.8%

Precision

98.7%

87.7%

 

By analyzing the underline HTML structures, that enwrap the major SRRs in the 42 web pages that MDR produced results, we found that 26 pages uses HTML table and form related tags to enwrap the SRRs, while 16 pages (marked by a “+” behind their web site name) uses other HTML tags such as P, LI, etc. Table 3 shows the test results of ViNTs and MDR on the two types of web pages respectively. The columns labeled as TF lists the results for the 26 table and form enwrapped pages, while the columns labed as NTF lists the results for the 16 pages, which use other types of tags to enwrap data.

Table 3. Categorized summary

 

ViNTs

MDR

TF

NTF

TF

NTF

#SRRs

544

251

544

251

#Extracted SRRs

544

251

460

19

#Correct SRRs

539

246

401

19

Recall

99.1%

98.0%

73.7%

7.7%

Precision

99.1%

98.0%

87.2%

100%