当前位置:  开发笔记 > 编程语言 > 正文

大型公共数据集?

如何解决《大型公共数据集?》经验,为你挑选了4个好方法。

我正在寻找一些大型公共数据集,特别是:

    已匿名化的大型示例Web服务器日志.

    用于数据库性能基准测试的数据集.

任何其他指向大型公共数据集的链接都将受到赞赏.我已经了解亚马逊的公共数据集:http: //aws.amazon.com/publicdatasets/



1> MrGomez..:

1.已匿名化的大型示例Web服务器日志.

这些工作开始于:

UCI机器学习库

匿名Microsoft Web数据

MSNBC.com匿名网络数据

Syskill和Webert网页评级

有比这些更多的数据集(参见其他答案的全部内容),但这是符合您原始标准的最低水果.作为奖励,如果您有他们可能知道的特定需求,他们会有联系链接.

2.用于数据库性能基准测试的数据集.

这听起来有点用词不当,因为您要求的经验数据集描述明确定义的 算法 问题.具体来说,听起来您正在尝试使用明确定义的规范化关系数据来查找可用于实时测试和基准测试各种数据库系统的数据集,这些数据可用作一组测试用例来确定最有效的解决方案,满足您的需求.

我不同意这种做法.而不是找到一连串的数据库系统及其固定的实现,最好将这些系统的算法 保证作为您的第一个调用端口.一旦确定了满足您需求的算法约束,您就可以研究一组固定解决方案,您可以对其效率进行基准测试,例如索引,排序,搜索,插入,删除和检索.

Wikipedia提供了一篇关于数据库测试概念的简明文章,您可以使用它来确定和编写用于基准测试性能的测试用例.例如,您可以使用不可知的数据访问接口(如JDBC和JDBC Benchmark)来确定每个操作的相对时间.从这里,您可以磨练正确的解决方案.

总之,去研究首先确定数据库的保证.一旦确定了一组候选解决方案,您可以通过测试(或以其他方式确定)每个所需操作的恒定时间性能来选择这些解决方案.



2> caesar0301..:

Based on Quora answers and my personal collections in my studies, an awesome-public-datasets repository was created and updated lively on GitHub:

Below is a snapshot version of this list. For a newest list, please visit Github:

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.

Climate

Australian Weather: http://www.bom.gov.au/climate/dwo/

Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/

自1929年以来的全球气候数据:http://www.tutiempo.net/en/Climate

NOAA白令海气候:http://www.beringclimate.noaa.gov/

NOAA气候数据集:http://ncdc.noaa.gov/data-access/quick-links

WU全球历史天气:http://www.wunderground.com/history/index.html

经济学

美国经济屁股.(AEA):http://www.aeaweb.org/RFE/toc.php?show = complete

EconData(UMD):http://inforumweb.umd.edu/econdata/econdata.html

互联网产品代码数据库:http://www.upcdatabase.com/

世界银行:http://data.worldbank.org/indicator

金融

CBOE期货交易所:http://cfe.cboe.com/Data/

Google财经:https://www.google.com/finance

Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0

NASDAQ: https://data.nasdaq.com/

OANDA: http://www.oanda.com/

OSU Financial data: http://fisher.osu.edu/fin/osudata.htm

Quandl: http://www.quandl.com/

St Louis Federal: http://research.stlouisfed.org/fred2/

Yahoo Finance: http://finance.yahoo.com/

Biology

CRCNS: http://crcns.org/data-sets

Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/

Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php

MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi

NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/

Protein structure: http://www.infobiotic.net/PSPbenchmarks/

Public Gene Data: http://www.pubgene.org/

Stanford Microarray Data: http://smd.stanford.edu/

UniGene: http://www.ncbi.nlm.nih.gov/unigene

Physics

NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html

Healthcare

EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm

Gapminder: http://www.gapminder.org/data/

Medicare Data File: http://go.cms.gov/19xxPN4

GeoSpace

EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse

Factual Global Location Data: http://www.factual.com/

Geo Spatial Data: http://geodacenter.asu.edu/datalist/

Transportation

Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html

Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations

Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems

Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229

Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/

NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013

OpenFlights (airport, airline and route data): http://openflights.org/data.html

RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120

RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp

Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds

U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm

Government

Archive-it: : https://www.archive-it.org/explore?show=Collections

Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument

Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1

Chicago: https://data.cityofchicago.org/

FDA: https://open.fda.gov/index.html

Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi

Guardian world governments: http://www.guardian.co.uk/world-government-data

HUD: http://www.huduser.org/portal/datasets/pdrdatas.html

London Datastore, U.K: http://data.london.gov.uk/dataset

New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx

NYC betanyc: http://betanyc.us/

NYC Open Data: http://nycplatform.socrata.com/

OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html

RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp

San Francisco Data sets: http://datasf.org/

The World Bank: http://wdronline.worldbank.org/

U.K. Government Data: http://data.gov.uk/data

U.S. Census Bureau: http://www.census.gov/data.html

U.S. Federal Government Agencies: http://www.data.gov/metric

U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset

U.S. Open Government: http://www.data.gov/open-gov/

UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/

United Nations: http://data.un.org/

US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm

Data Challenges

Challenges in Machine Learning: http://www.chalearn.org/

ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/

Kaggle Competition Data: http://www.kaggle.com/

KDD Cup by Tencent 2012: https://www.kddcup2012.org/

Netflix Prize: http://www.netflixprize.com/leaderboard

Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge

Machine Learning

eBay Online Auctions: http://www.modelingonlineauctions.com/datasets

IMDb database: http://www.imdb.com/interfaces

Keel Repository: http://sci2s.ugr.es/keel/datasets.php

Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action

Machine Learning Data Set Repository: http://mldata.org/

Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset

More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets

MovieLens Data Sets: http://datahub.io/dataset/movielens

RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data

Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized

SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/

UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/

University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html

Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Natural Language

40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list

ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/

ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/

Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html

Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670

Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13

Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs

Hansards: http://www.isi.edu/natural-language/download/hansard/

Machine Translation: http://statmt.org/wmt11/translation-task.html#download

SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

WordNet: http://wordnet.princeton.edu/wordnet/download/

Image Processing

2GB of photos of cats: http://bit.do/UJZZ

Face Recognition Benchmark: http://www.face-rec.org/databases/

ImageNet: http://www.image-net.org/

Time Series

Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl

UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/

Social Sciences

China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml

CMU Enron Email: http://www.cs.cmu.edu/~enron/

Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php

Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix

Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html

Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn

General Social Survey (GSS): http://www3.norc.org/GSS+Website/

GetGlue (users rating TV shows): http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz

GitHub Archive: http://www.githubarchive.org/

ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp

Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks

PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/

Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/

SourceForge Graph: http://www.nd.edu/~oss/Data/data.html

Titanic Survival Data Set: https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip

Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html

UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/

UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm

UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php

Universities Worldwide: http://univ.cc/

UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html

Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g

Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/

Complex Networks

CrossRef DOI URLs: https://archive.org/details/doi-urls

DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP

NBER Patent Citations: http://nber.org/patents/

NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html

Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm

PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/

3> Gene De Lisa..:


这是几个.玩得开心.

http://archive.ics.uci.edu/ml/

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

http://crawdad.org/

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.govloop.com

http://data.gov.uk/

http://data.medicare.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sunlightlabs.com

https://datamarket.azure.com/

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://books.google.com/ngrams/

http://linkeddata.org/

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://rechercheisidore.fr

http://reddit.com/r/datasets

http://timetric.com/public-data/

http://www2.jpl.nasa.gov/srtm

http://www.bls.gov/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.datakc.org

http://www.factual.com/

http://www.freebase.com/

http://www.infochimps.com

http://www.kaggle.com/

http://build.kiva.org/

http://www.imdb.com/interfaces

http://dbpedia.org



4> Jason S..:

只是一个想法:

USGS地理名称数据库

美国农业部植物清单

许多州的GIS存储库中的任何一个,例如NH的GRANIT

推荐阅读
手机用户2402852387
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有