WWW::Search and AutoSearch and WebSearch
========================================
WHAT IS NEW IN WWW::Search 2.09? (2000-02-08)
----------------------------------------------
overview:
* new module WWW::Search::Test
* new methods maintainer(), gui_query()
* bug fixes for several backends (as usual)
* some backends removed to their own modules
For details, see the file ChangeLog and/or the pod of each module.
WHAT IS WWW::Search?
--------------------
WWW::Search is a collection of Perl modules which provide an API to
WWW (and similar) search engines. Currently WWW::Search includes
backends for variations of AltaVista, Dejanews, Infoseek, Lycos,
Magellan, and WebCrawler, among others. Backends for some engines can
be obtained separately, such as Excite, HotBot, and Yahoo. This
distribution includes two applications built from this library:
AutoSearch (a program to automate tracking of search results over
time), and WebSearch, a small demonstration program to drive the
library.
WWW::Search does NOT try to emulate the default search that you would
get with each search engine's GUI. WWW::Search performs the search in
a way that is efficient and convenient for text processing. This
might include getting "text-only" pages, making "OR" the default query
term operator instead of "AND", ungrouping same-site results, making
sure descriptions are turned on, and increasing the number of hits per
page, among other tricks. Hopefully, the documentation for each
backend will tell you about the more important tricks being used.
Because WWW::Search depends on parsing the HTML output of web search
engines it will fail if the search engine operators change their
format (an unfortunately frequent occurrence). WWW::Search includes a
test suite for many backends which verifies that they are functioning
correctly. The test suite can be run by typing 'make test_parsing';
see under INSTALLATION below for details. As of the day of the
release the current backend status is:
AltaVista working
AltaVista::AdvancedNews not working
AltaVista::AdvancedWeb not working
AltaVista::Careers working? (not in test suite)
AltaVista::Intranet working
AltaVista::News not working
AltaVista::Web working
AOL::Classifieds::Employment working (not in test suite)
Crawler partially working?
Deja/Dejanews working
Dice working? (not in test suite)
Euroseek working? (not in test suite)
Excite::News working
ExciteForWebServers not working
Fireball working? (not in test suite)
FolioViews working
Google working
Gopher not working? (not in test suite)
GoTo working
HeadHunter working? (not in test suite)
HotFiles working
Infoseek working
Infoseek::Companies working
Infoseek::Email not working
Infoseek::News working
Infoseek::Web working
Livelink not working? (not in test suite)
LookSmart working
Lycos working
Lycos::Pages defunct
Lycos::Sites defunct
Magellan working
MetaCrawler working
Metapedia working? (not in test suite)
Monster working? (not in test suite)
MSIndexServer not working?
NetFind working
NorthernLight working
Null working
OpenDirectory working
PLweb not working
Profusion defunct
Search97 not working
SFgate working
Simple not working? (not in test suite)
Snap working
Verity not working (not in test suite)
WebCrawler working
Yahoo::Classifieds::Employment working? (not in test suite)
ZDNet working
''Partially working'' indicates that some tests passed and some failed.
The following backends are now registered at CPAN independently (not
included with the WWW::Search release):
Excite
HotBot
Yahoo
WHAT IS AutoSearch?
-------------------
WWW::Search's primary client is AutoSearch. AutoSearch performs a
web-based search and puts the results set in a web page. It
periodically updates this web page, indicating how the search changes
over time. Sample output from AutoSearch can be found at
. Output format is
configurable.
See the man page for AutoSearch details, or the DEMONSTRATION section
below for quick-start instructions.
REQUIREMENTS
------------
WWW::Search requires Perl5, the libwww-perl module suite, the URI
module, and the HTML::Parser module. Some of the "not working"
modules require the HTML::TreeBuilder module (so you can ignore
warnings about TreeBuilder during the build). For information on
Perl5, see . For all the modules, see
to find a CPAN site near you.
At the time of this release, the primary WWW::Search development and
testing is under perl version 5.005_03 on Sun UltraSparc Solaris 7 and
under ActiveState perl build 522 on Windows NT 4.0 service pack 6.
WWW::Search has also been built and tested successfully on Win98J
(that's Japanese) with ActiveState perl build 517.
If you have successfully built and tested WWW::Search on any other
(obscure) platform / version combination, please let me know!
MartinThurn@iname.com
AVAILABILITY
------------
The latest version of WWW::Search should always be available on CPAN.
Here is the best URL for finding it:
http://www.perl.com/CPAN-local/modules/by-module/WWW
INSTALLATION
------------
In order to use this package you will need Perl version 5.002 or
better.
It is hightly recommended that you use CPAN.pm to install WWW::Search.
It will automatically install all the prerequisite modules and put
everything in the right places. On a unix system, just type
perl -MCPAN -e 'install WWW::Search'.
Otherwise, you can install WWW::Search as you would any perl module
library, by running these commands in the WWW-Search-x.xx directory
after unpacking the archive (and after installing all the prerequisite
modules):
perl Makefile.PL
make test
make install
On Win32, maintenance and testing is done with Microsoft's nmake.exe;
use 'nmake' instead of 'make' in the above sequence of commands.
When you do `perl makefile.pl` on Win32, you might get warnings that a
whole bunch of 'zero*.out' files are missing. This seems to be a bug
in some versions of WinZip which refuse to extract empty files from
the archive. You can ignore these warnings.
If you want to install a private copy of WWW::Search in your home
directory, then you should do the installation with something like
these commands:
perl Makefile.PL INSTALLDIRS=perl PREFIX=/my/perl/lib
make test
make pure_perl_install UNINST=1
Don't forget to add /my/perl/lib to your PERL5LIB environment variable
(or use lib '/my/perl/lib'; or unshift @INC, '/my/perl/lib')!
TESTING
-------
The "make test_parsing" command compares expected output from
WWW::Search with actual output. You can give arguments to the
test_parsing program by using the TEST_ARGS macro. For example, the
following command only runs the external queries for Dejanews:
make test_parsing TEST_ARGS='-e Dejanews -x'
To see all the available options, do this:
make test_parsing TEST_ARGS='-help'
The "test_parsing" utility detects two kinds of errors:
- internal parsing:
First it checks to make sure that your system computes
the same results as my system based on some saved
Web queries. This test should always pass for working
backends; if it doesn't, send me mail.
- external queries:
Second, it makes real queries against the search engines
and compares them with some saved results.
External queries can fail for several reasons:
- new pages have been added which match the test queries, or matching
pages have been deleted, causing the page count to go too far out of
whack from the expected number (not necessarily a bad thing)
- changes in the web search engine output which break WWW::Search's
parsers, usually resulting in no URLs being returned (a bad thing)
If the external tests fail, please either investigate the error or
send a description of the problem, a list of your operating system and
all relevant perl version number, and the relevant output of "make
test_parsing" to the maintainer of the backend for the search engine
that fails.
DISCUSSION, BUG REPORTS, AND IMPROVEMENTS
-----------------------------------------
Feedback about WWW::Search is encouraged. If you're using it for a
neat application, please let us know. If you'd like to (or have
already) implement and publish a new backend for WWW::Search, let us
know so we don't duplicate work.
Feedback, bug reports, fixes, and new backends should be sent to
Martin Thurn . When sending e-mail, please
please put [WWW::Search] at the beginning of the subject line (or risk
me losing the message in the pile).
There is a mailing list for WWW::Search discussion. To subscribe,
send "subscribe info-www-search" as the body of a message to
. If you use WWW::Search at all, you
should subscribe to the mailing list. Bug fixes are posted there as
soon as they're fixed.
Back-end-related bug reports ("search engine ABC doesn't work") should
be sent to the author of the backend (backend authors are identified
in the corresponding man page and in the output of `make
test_parsing`). General bugs should be reported to
.
When submitting a bug report or request for help, please remember to
include:
- your operating system name and version
- your version of perl
- your version of WWW::Search
- your version of the backend
- the code you ran to produce the error (PLEASE cut-and-paste, do not just summarize!)
- sample output showing the error (PLEASE cut-and-paste, do not just summarize!)
DEMONSTRATION
-------------
After installing the distribution, try:
WebSearch '"Your Name Here"'
or, if you are on Win32:
WebSearch "\"Your Name Here\""
to see who's talking about you on the web. Then (in your web page
directory), try:
cd /path/to/your/web/pages
AutoSearch -n me_on_the_web -s '"Your Name Here"' me
or, if you are on Win32:
cd /path/to/your/web/pages
AutoSearch -n me_on_the_web -s "\"Your Name Here\"" me
and the web page /path/to/your/web/pages/me/index.html will be created
summarizing this information. If you are on UNIX you can add
0 3 * * 1 AutoSearch /path/to/your/web/pages/me
to your crontab to update this search every week at 3:00 Monday
morning, for example.
DOCUMENTATION
-------------
See `perldoc WWW::Search` after installation for an overview of the
library. POD-style documentation is also included in all modules and
programs, so you can do `perldoc WebSearch` and `perldoc AutoSearch`
and `perldoc WWW::Search::AltaVista` after installation.
FUTURE PLANS
------------
Some ideas:
- use LWP::ParallelUA to speed up multiple backend search requests
(I'm trying to decide what the API interface will look like; please
send suggestions). Contact
- updates to each backend that will force WWW::Search to perform the
same search as the engine's web GUI (I'm looking for contributions of
the precise arguments that will produce such a search for each engine;
i.e. the hash that should be passed as the second argument to
native_query). Contact
- application-level proxy support (I'm looking for a contribution
here from someone who uses/needs proxy support and can test it).
Contact
- more widespread use of result tags description, date, size,
etc. across all backends
- a freeze/restore interface to suspend and resume in-progress queries
- more backends
Contributions are always welcome. Send me e-mail if you plan a new
backend and to discuss architectural changes (to avoid duplicating
work). Contact
SUPPORT AND CREDITS
-------------------
The WWW::Search architecture is by John Heidemann with feedback from
the other contributors. NOTE: This list is no longer updated; consult
the on-line documentation and/or the output of `make test` to find out
who is currently maintaining each component.
PLATFORM SUPPORT:
Unix John Heidemann
Windows Jim Smyser
(see )
APPLICATIONS:
WebSearch John Heidemann
AutoSearch William Scheding
BACK-ENDS:
AltaVista John Heidemann
Dejanews Cesare Feroldi de Rosa
and Martin Thurn
Crawler Andreas Borchert
Excite GLen Pringle
and Martin Thurn
ExciteForWebServers Paul Lindner
Fireball Andreas Borchert
FolioViews Paul Lindner
Gopher Paul Lindner
HotBot William Scheding and Martin Thurn
HotFiles Jim Smyser
Infoseek Cesare Feroldi de Rosa and Martin Thurn
Livelink Paul Lindner
Lycos William Scheding and John Heidemann,
Martin Thurn
Magellan Martin Thurn
MSIndexServer Paul Lindner
NorthernLight Jim Smyser
Null Paul Lindner
OpenDirectory Jim Smyser
PLWeb Paul Lindner
Profusion Jim Smyser
Search97 Paul Lindner
SFgate Paul Lindner
Simple Paul Lindner
Snap Jim Smyser
Verity Paul Lindner
WebCrawler Martin Thurn
Yahoo William Scheding and Martin Thurn
ZDNet Jim Smyser
AutoSearch is based on an earlier implementation by Kedar Jog
with advice from Joe Touch .
Bugs and extensions (to the software and documentation) have been
identified by William Scheding , T. V. Raman
(proxy support), C. Feroldi ,
Larry Virden , Paul Lindner ,
Guy Decoux , R Chandrasekar (Mickey)
, Martin Thurn ,
Chris Nandor , Martin Valldeby
, Jim Smyser , Darren
Stalder , Neil Bowers
, Ave Wrigley ,
Andreas Borchert , Jim Smyser
.
Bugs have reported by Joseph McDonald , Juan Jose
Amor , Bowen Dwelle , Vassilis
Papadimos , Vidyut Luther ,
Chris P. Acantilado .
COPYRIGHT
---------
Copyright (c) 1996 University of Southern California.
All rights reserved.
Redistribution and use in source and binary forms are permitted
provided that the above copyright notice and this paragraph are
duplicated in all such forms and that any documentation, advertising
materials, and other materials related to such distribution and use
acknowledge that the software was developed by the University of
Southern California, Information Sciences Institute. The name of the
University may not be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Portions of this README are derived from the README for libwww-perl.