18赞

是否有可能使用Erlang,Mnesia和Yaws开发强大的Web搜索引擎？

作者：Life一切安好 | 2023-09-04 10:32

如何解决《是否有可能使用Erlang,Mnesia和Yaws开发强大的Web搜索引擎？》经验，为你挑选了1个好方法。

我正在考虑使用Erlang,Mnesia和Yaws开发一个网络搜索引擎.是否有可能使用这些软件制作功能强大且速度最快的网络搜索引擎？它需要做什么以及我该如何开始？

1> Muzaaya Josh..：

Erlang今天可以成为最强大的网络爬虫.让我带您浏览我的简单抓取工具.

步骤1.我创建一个简单的并行模块,我称之为mapreduce

-module(mapreduce).
-export([compute/2]).
%%=====================================================================
%% usage example
%% Module = string
%% Function = tokens
%% List_of_arg_lists = [["file\r\nfile","\r\n"],["muzaaya_joshua","_"]]
%% Ans = [["file","file"],["muzaaya","joshua"]]
%% Job being done by two processes
%% i.e no. of processes spawned = length(List_of_arg_lists)

compute({Module,Function},List_of_arg_lists)->
    S = self(),
    Ref = erlang:make_ref(),
    PJob = fun(Arg_list) -> erlang:apply(Module,Function,Arg_list) end,
    Spawn_job = fun(Arg_list) -> 
                    spawn(fun() -> execute(S,Ref,PJob,Arg_list) end)
                end,
    lists:foreach(Spawn_job,List_of_arg_lists),
    gather(length(List_of_arg_lists),Ref,[]).
   
gather(0, _, L) -> L;
gather(N, Ref, L) ->
    receive
        {Ref,{'EXIT',_}} -> gather(N-1,Ref,L);
        {Ref, Result} -> gather(N-1, Ref, [Result|L])
    end.
    
execute(Parent,Ref,Fun,Arg)->
    Parent ! {Ref,(catch Fun(Arg))}.

步骤2. HTTP Client

One通常使用inets httpc module内置于erlang或ibrowse.但是,对于内存管理和速度(尽可能降低内存占用量),一个优秀的erlang程序员会选择使用curl.通过应用os:cmd/1获取该curl命令行的那个,可以将输出直接输入到erlang调用函数中.然而,更好的是,使curl将其输出转换为文件然后我们的应用程序有另一个线程(进程)读取和解析这些文件

Command: curl "http://www.erlang.org" -o "/downloaded_sites/erlang/file1.html"

In Erlang

os:cmd("curl \"http://www.erlang.org\" -o \"/downloaded_sites/erlang/file1.html\"").

所以你可以产生很多进程.您记得在执行该命令时转义URL以及输出文件路径.另一方面,有一个过程,其工作是观察下载页面的目录.它读取并解析它们的这些页面,然后可以在解析后删除或保存在不同的位置甚至更好,使用zip module

folder_check()->
    spawn(fun() -> check_and_report() end),
    ok.

-define(CHECK_INTERVAL,5).

check_and_report()->
    %% avoid using
    %% filelib:list_dir/1
    %% if files are many, memory !!!
    case os:cmd("ls | wc -l") of
        "0\n" -> ok;
        "0" -> ok;
        _ -> ?MODULE:new_files_found()
    end,
    sleep(timer:seconds(?CHECK_INTERVAL)),
    %% keep checking
    check_and_report().

new_files_found()->
    %% inform our parser to pick files
    %% once it parses a file, it has to 
    %% delete it or save it some
    %% where else
    gen_server:cast(?MODULE,files_detected).

第3步.html解析器.
更好地利用它mochiweb's html parser and XPATH.这将帮助您解析并获取所有您喜欢的HTML标记,提取内容然后再好.下面的例子,我把重点放在只Keywords,description并title在标记

shell中的模块测试......非常棒的结果!!!

2> spider_bot:parse_url("http://erlang.org").
[[[],[],
  {"keywords",
   "erlang, functional, programming, fault-tolerant, distributed, multi-platform, portable, software, multi-core, smp, concurrency "},
  {"description","open-source erlang official website"}],
 {title,"erlang programming language, official website"}]

3> spider_bot:parse_url("http://facebook.com").
[[{"description",
   " facebook is a social utility that connects people with friends and others who work, study and live around them. people use facebook to keep up with friends, upload an unlimited number of photos, post links
 and videos, and learn more about the people they meet."},
  {"robots","noodp,noydir"},
    [],[],[],[]],
 {title,"incompatible browser | facebook"}]

4> spider_bot:parse_url("http://python.org").
[[{"description",
   "      home page for python, an interpreted, interactive, object-oriented, extensible\n      programming language. it provides an extraordinary combination of clarity and\n      versatility, and is free and
comprehensively ported."},
  {"keywords",
   "python programming language object oriented web free source"},
  []],
 {title,"python programming language – official website"}]

5> spider_bot:parse_url("http://www.house.gov/").
[[[],[],[],
  {"description",
   "home page of the united states house of representatives"},
  {"description",
   "home page of the united states house of representatives"},
  [],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
  [],[],[]|...],
 {title,"united states house of representatives, 111th congress, 2nd session"}]

您现在可以意识到,我们可以根据关键字对页面编制索引,并提供良好的页面修复计划.另一个挑战是如何制作一个爬虫(一种可以在整个网络中移动的东西,从一个域到另一个域),但这个很容易.它可以通过解析href标签的Html文件来实现.使HTML Parser提取所有href标记,然后您可能需要一些正则表达式来获取给定域下的链接.

运行爬虫

7> spider_connect:conn2("http://erlang.org").        

        Links: ["http://www.erlang.org/index.html",
                "http://www.erlang.org/rss.xml",
                "http://erlang.org/index.html","http://erlang.org/about.html",
                "http://erlang.org/download.html",
                "http://erlang.org/links.html","http://erlang.org/faq.html",
                "http://erlang.org/eep.html",
                "http://erlang.org/starting.html",
                "http://erlang.org/doc.html",
                "http://erlang.org/examples.html",
                "http://erlang.org/user.html",
                "http://erlang.org/mirrors.html",
                "http://www.pragprog.com/titles/jaerlang/programming-erlang",
                "http://oreilly.com/catalog/9780596518189",
                "http://erlang.org/download.html",
                "http://www.erlang-factory.com/conference/ErlangUserConference2010/speakers",
                "http://erlang.org/download/otp_src_R14B.readme",
                "http://erlang.org/download.html",
                "https://www.erlang-factory.com/conference/ErlangUserConference2010/register",
                "http://www.erlang-factory.com/conference/ErlangUserConference2010/submit_talk",
                "http://www.erlang.org/workshop/2010/",
                "http://erlangcamp.com","http://manning.com/logan",
                "http://erlangcamp.com","http://twitter.com/erlangcamp",
                "http://www.erlang-factory.com/conference/London2010/speakers/joearmstrong/",
                "http://www.erlang-factory.com/conference/London2010/speakers/RobertVirding/",
                "http://www.erlang-factory.com/conference/London2010/speakers/MartinOdersky/",
                "http://www.erlang-factory.com/",
                "http://erlang.org/download/otp_src_R14A.readme",
                "http://erlang.org/download.html",
                "http://www.erlang-factory.com/conference/London2010",
                "http://github.com/erlang/otp",
                "http://erlang.org/download.html",
                "http://erlang.org/doc/man/erl_nif.html",
                "http://github.com/erlang/otp",
                "http://erlang.org/download.html",
                "http://www.erlang-factory.com/conference/ErlangUserConference2009",
                "http://erlang.org/doc/efficiency_guide/drivers.html",
                "http://erlang.org/download.html",
                "http://erlang.org/workshop/2009/index.html",
                "http://groups.google.com/group/erlang-programming",
                "http://www.erlang.org/eeps/eep-0010.html",
                "http://erlang.org/download/otp_src_R13B.readme",
                "http://erlang.org/download.html",
                "http://oreilly.com/catalog/9780596518189",
                "http://www.erlang-factory.com",
                "http://www.manning.com/logan",
                "http://www.erlang.se/euc/08/index.html",
                "http://erlang.org/download/otp_src_R12B-5.readme",
                "http://erlang.org/download.html",
                "http://erlang.org/workshop/2008/index.html",
                "http://www.erlang-exchange.com",
                "http://erlang.org/doc/highlights.html",
                "http://www.erlang.se/euc/07/",
                "http://www.erlang.se/workshop/2007/",
                "http://erlang.org/eep.html",
                "http://erlang.org/download/otp_src_R11B-5.readme",
                "http://pragmaticprogrammer.com/titles/jaerlang/index.html",
                "http://erlang.org/project/test_server",
                "http://erlang.org/download-stats/",
                "http://erlang.org/user.html#smtp_client-1.0",
                "http://erlang.org/user.html#xmlrpc-1.13",
                "http://erlang.org/EPLICENSE",
                "http://erlang.org/project/megaco/",
                "http://www.erlang-consulting.com/training_fs.html",
                "http://erlang.org/old_news.html"]
ok

存储:是搜索引擎最重要的概念之一.将搜索引擎数据存储在MySQL,Oracle,MS SQL等RDBMS中是一个很大的错误.这些系统非常复杂,与它们接口的应用程序采用启发式算法.这将我们带到了 Key-Value Stores,其中我最好的两个是Couch Base Server和Riak.这些都是伟大的云文件系统.另一个重要参数是缓存.使用say来实现缓存Memcached,其中上面提到的其他两个存储系统支持它.搜索引擎的存储系统应该是schemaless DBMS,重点是Availability rather than Consistency.有关搜索引擎的更多信息,请访问: http: //en.wikipedia.org/wiki/Web_search_engine

推荐阅读

程序员
弱自我与弱自我的区别()

如何解决《弱自我与弱自我的区别()》经验，为你挑选了1个好方法。 ... [详细]
程序员
将AWS开发工具包与Web Worker一起使用

如何解决《将AWS开发工具包与WebWorker一起使用》经验，为你挑选了0个好方法。 ... [详细]
程序员
PDFBox的.Java:如何只打印一页PDF而不是完整文档？

如何解决《PDFBox的.Java:如何只打印一页PDF而不是完整文档？》经验，为你挑选了0个好方法。 ... [详细]
程序员
使用Angular.js在一些值之后中断表行

如何解决《使用Angular.js在一些值之后中断表行》经验，为你挑选了0个好方法。 ... [详细]
程序员
空列表和空列表有什么区别？

如何解决《空列表和空列表有什么区别？》经验，为你挑选了1个好方法。 ... [详细]
程序员
perl基础知识 - SHIFT功能如何工作

如何解决《perl基础知识-SHIFT功能如何工作》经验，为你挑选了1个好方法。 ... [详细]
程序员
将Python Pandas数据帧上传到MySQL - InternalError:1366,"不正确的字符串值"

如何解决《将PythonPandas数据帧上传到MySQL-InternalError:1366,"不正确的字符串值"》经验，为你挑选了1个好方法。 ... [详细]
程序员
删除Realm中的列

如何解决《删除Realm中的列》经验，为你挑选了1个好方法。 ... [详细]
程序员
Symfony2-如何在准则2中查询带有条件的左联接

如何解决《Symfony2-如何在准则2中查询带有条件的左联接》经验，为你挑选了1个好方法。 ... [详细]
程序员
处理抽象类和类型参数固有的类

如何解决《处理抽象类和类型参数固有的类》经验，为你挑选了0个好方法。 ... [详细]
程序员
用户下订单时调用函数(实际上是API调用)

如何解决《用户下订单时调用函数(实际上是API调用)》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何以编程方式禁用接近传感器

如何解决《如何以编程方式禁用接近传感器》经验，为你挑选了0个好方法。 ... [详细]
程序员
如何将指针作为迭代器返回？

如何解决《如何将指针作为迭代器返回？》经验，为你挑选了1个好方法。 ... [详细]
程序员
This()vs Target()aspectj

如何解决《This()vsTarget()aspectj》经验，为你挑选了1个好方法。 ... [详细]
程序员
django foreignkey(用户)的模特

如何解决《djangoforeignkey(用户)的模特》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何添加EXIF信息以在.NET中对图像进行地理标记？

如何解决《如何添加EXIF信息以在.NET中对图像进行地理标记？》经验，为你挑选了1个好方法。 ... [详细]
程序员
尝试在Visual Studio 2013中引用静态库项目时出现链接器错误

如何解决《尝试在VisualStudio2013中引用静态库项目时出现链接器错误》经验，为你挑选了1个好方法。 ... [详细]
程序员
当页面在每页上空闲3秒钟时,移至下一页和后续页面

如何解决《当页面在每页上空闲3秒钟时,移至下一页和后续页面》经验，为你挑选了0个好方法。 ... [详细]
程序员
二进制搜索树优于C++中的向量

如何解决《二进制搜索树优于C++中的向量》经验，为你挑选了1个好方法。 ... [详细]
程序员
使用'_'React/React-native为函数名添加前缀是什么意思？

如何解决《使用'_'React/React-native为函数名添加前缀是什么意思？》经验，为你挑选了1个好方法。 ... [详细]

Life一切安好

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章