我正在考虑使用Erlang,Mnesia和Yaws开发一个网络搜索引擎.是否有可能使用这些软件制作功能强大且速度最快的网络搜索引擎?它需要做什么以及我该如何开始?
Erlang今天可以成为最强大的网络爬虫.让我带您浏览我的简单抓取工具.
步骤1.我创建一个简单的并行模块,我称之为mapreduce
-module(mapreduce). -export([compute/2]). %%===================================================================== %% usage example %% Module = string %% Function = tokens %% List_of_arg_lists = [["file\r\nfile","\r\n"],["muzaaya_joshua","_"]] %% Ans = [["file","file"],["muzaaya","joshua"]] %% Job being done by two processes %% i.e no. of processes spawned = length(List_of_arg_lists) compute({Module,Function},List_of_arg_lists)-> S = self(), Ref = erlang:make_ref(), PJob = fun(Arg_list) -> erlang:apply(Module,Function,Arg_list) end, Spawn_job = fun(Arg_list) -> spawn(fun() -> execute(S,Ref,PJob,Arg_list) end) end, lists:foreach(Spawn_job,List_of_arg_lists), gather(length(List_of_arg_lists),Ref,[]).
gather(0, _, L) -> L; gather(N, Ref, L) -> receive {Ref,{'EXIT',_}} -> gather(N-1,Ref,L); {Ref, Result} -> gather(N-1, Ref, [Result|L]) end.
execute(Parent,Ref,Fun,Arg)-> Parent ! {Ref,(catch Fun(Arg))}.
步骤2. HTTP Client
One通常使用inets httpc module
内置于erlang或ibrowse
.但是,对于内存管理和速度(尽可能降低内存占用量),一个优秀的erlang程序员会选择使用curl
.通过应用os:cmd/1
获取该curl命令行的那个,可以将输出直接输入到erlang调用函数中.然而,更好的是,使curl将其输出转换为文件然后我们的应用程序有另一个线程(进程)读取和解析这些文件
Command: curl "http://www.erlang.org" -o "/downloaded_sites/erlang/file1.html"所以你可以产生很多进程.您记得在执行该命令时转义URL以及输出文件路径.另一方面,有一个过程,其工作是观察下载页面的目录.它读取并解析它们的这些页面,然后可以在解析后删除或保存在不同的位置甚至更好,使用
In Erlang
os:cmd("curl \"http://www.erlang.org\" -o \"/downloaded_sites/erlang/file1.html\"").
zip module
folder_check()-> spawn(fun() -> check_and_report() end), ok. -define(CHECK_INTERVAL,5). check_and_report()-> %% avoid using %% filelib:list_dir/1 %% if files are many, memory !!! case os:cmd("ls | wc -l") of "0\n" -> ok; "0" -> ok; _ -> ?MODULE:new_files_found() end, sleep(timer:seconds(?CHECK_INTERVAL)), %% keep checking check_and_report(). new_files_found()-> %% inform our parser to pick files %% once it parses a file, it has to %% delete it or save it some %% where else gen_server:cast(?MODULE,files_detected).
第3步.html解析器.
更好地利用它mochiweb's html parser and XPATH
.这将帮助您解析并获取所有您喜欢的HTML标记,提取内容然后再好.下面的例子,我把重点放在只Keywords
,description
并title
在标记
shell中的模块测试......非常棒的结果!!!
2> spider_bot:parse_url("http://erlang.org"). [[[],[], {"keywords", "erlang, functional, programming, fault-tolerant, distributed, multi-platform, portable, software, multi-core, smp, concurrency "}, {"description","open-source erlang official website"}], {title,"erlang programming language, official website"}]
3> spider_bot:parse_url("http://facebook.com"). [[{"description", " facebook is a social utility that connects people with friends and others who work, study and live around them. people use facebook to keep up with friends, upload an unlimited number of photos, post links and videos, and learn more about the people they meet."}, {"robots","noodp,noydir"}, [],[],[],[]], {title,"incompatible browser | facebook"}]
4> spider_bot:parse_url("http://python.org"). [[{"description", " home page for python, an interpreted, interactive, object-oriented, extensible\n programming language. it provides an extraordinary combination of clarity and\n versatility, and is free and comprehensively ported."}, {"keywords", "python programming language object oriented web free source"}, []], {title,"python programming language – official website"}]
5> spider_bot:parse_url("http://www.house.gov/"). [[[],[],[], {"description", "home page of the united states house of representatives"}, {"description", "home page of the united states house of representatives"}, [],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]|...], {title,"united states house of representatives, 111th congress, 2nd session"}]
您现在可以意识到,我们可以根据关键字对页面编制索引,并提供良好的页面修复计划.另一个挑战是如何制作一个爬虫(一种可以在整个网络中移动的东西,从一个域到另一个域),但这个很容易.它可以通过解析href标签的Html文件来实现.使HTML Parser提取所有href标记,然后您可能需要一些正则表达式来获取给定域下的链接.
运行爬虫
7> spider_connect:conn2("http://erlang.org"). Links: ["http://www.erlang.org/index.html", "http://www.erlang.org/rss.xml", "http://erlang.org/index.html","http://erlang.org/about.html", "http://erlang.org/download.html", "http://erlang.org/links.html","http://erlang.org/faq.html", "http://erlang.org/eep.html", "http://erlang.org/starting.html", "http://erlang.org/doc.html", "http://erlang.org/examples.html", "http://erlang.org/user.html", "http://erlang.org/mirrors.html", "http://www.pragprog.com/titles/jaerlang/programming-erlang", "http://oreilly.com/catalog/9780596518189", "http://erlang.org/download.html", "http://www.erlang-factory.com/conference/ErlangUserConference2010/speakers", "http://erlang.org/download/otp_src_R14B.readme", "http://erlang.org/download.html", "https://www.erlang-factory.com/conference/ErlangUserConference2010/register", "http://www.erlang-factory.com/conference/ErlangUserConference2010/submit_talk", "http://www.erlang.org/workshop/2010/", "http://erlangcamp.com","http://manning.com/logan", "http://erlangcamp.com","http://twitter.com/erlangcamp", "http://www.erlang-factory.com/conference/London2010/speakers/joearmstrong/", "http://www.erlang-factory.com/conference/London2010/speakers/RobertVirding/", "http://www.erlang-factory.com/conference/London2010/speakers/MartinOdersky/", "http://www.erlang-factory.com/", "http://erlang.org/download/otp_src_R14A.readme", "http://erlang.org/download.html", "http://www.erlang-factory.com/conference/London2010", "http://github.com/erlang/otp", "http://erlang.org/download.html", "http://erlang.org/doc/man/erl_nif.html", "http://github.com/erlang/otp", "http://erlang.org/download.html", "http://www.erlang-factory.com/conference/ErlangUserConference2009", "http://erlang.org/doc/efficiency_guide/drivers.html", "http://erlang.org/download.html", "http://erlang.org/workshop/2009/index.html", "http://groups.google.com/group/erlang-programming", "http://www.erlang.org/eeps/eep-0010.html", "http://erlang.org/download/otp_src_R13B.readme", "http://erlang.org/download.html", "http://oreilly.com/catalog/9780596518189", "http://www.erlang-factory.com", "http://www.manning.com/logan", "http://www.erlang.se/euc/08/index.html", "http://erlang.org/download/otp_src_R12B-5.readme", "http://erlang.org/download.html", "http://erlang.org/workshop/2008/index.html", "http://www.erlang-exchange.com", "http://erlang.org/doc/highlights.html", "http://www.erlang.se/euc/07/", "http://www.erlang.se/workshop/2007/", "http://erlang.org/eep.html", "http://erlang.org/download/otp_src_R11B-5.readme", "http://pragmaticprogrammer.com/titles/jaerlang/index.html", "http://erlang.org/project/test_server", "http://erlang.org/download-stats/", "http://erlang.org/user.html#smtp_client-1.0", "http://erlang.org/user.html#xmlrpc-1.13", "http://erlang.org/EPLICENSE", "http://erlang.org/project/megaco/", "http://www.erlang-consulting.com/training_fs.html", "http://erlang.org/old_news.html"] ok存储:是搜索引擎最重要的概念之一.将搜索引擎数据存储在MySQL,Oracle,MS SQL等RDBMS中是一个很大的错误.这些系统非常复杂,与它们接口的应用程序采用启发式算法.这将我们带到了 Key-Value Stores,其中我最好的两个是
Couch Base Server
和Riak
.这些都是伟大的云文件系统.另一个重要参数是缓存.使用say来实现缓存Memcached
,其中上面提到的其他两个存储系统支持它.搜索引擎的存储系统应该是schemaless DBMS
,重点是Availability rather than Consistency
.有关搜索引擎的更多信息,请访问: http: //en.wikipedia.org/wiki/Web_search_engine