使用python:
例如,假设你想从一些这样的网站凑在CSV形式的外汇报价:fxquotes
然后...
from BeautifulSoup import BeautifulSoup import urllib,string,csv,sys,os from string import replace date_s = '&date1=01/01/08' date_f = '&date=11/10/08' fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us' fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1' cur1,cur2 = 'USD','AUD' fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1 fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end data = urllib.urlopen(fx_url).read() soup = BeautifulSoup(data) data = str(soup.findAll('pre', limit=1)) data = replace(data,'[','') data = replace(data,']','') file_location = '/Users/location_edit_this' file_name = file_location + 'usd_aus.csv' file = open(file_name,"w") file.write(data) file.close()
编辑:从表中获取值:示例来自:palewire
from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://www.palewire.com/scrape/albums/2007.html" page = mech.open(url) html = page.read() soup = BeautifulSoup(html) table = soup.find("table", border=1) for row in table.findAll('tr')[1:]: col = row.findAll('td') rank = col[0].string artist = col[1].string album = col[2].string cover_link = col[3].img['src'] record = (rank, artist, album, cover_link) print "|".join(record)
Juan A. Nava.. 10
这是我使用(当前)最新版本的BeautifulSoup的python版本,可以使用,例如,
$ sudo easy_install beautifulsoup4
该脚本从标准输入读取HTML,并以适当的CSV格式输出所有表中的文本.
#!/usr/bin/python from bs4 import BeautifulSoup import sys import re import csv def cell_text(cell): return " ".join(cell.stripped_strings) soup = BeautifulSoup(sys.stdin.read()) output = csv.writer(sys.stdout) for table in soup.find_all('table'): for row in table.find_all('tr'): col = map(cell_text, row.find_all(re.compile('t[dh]'))) output.writerow(col) output.writerow([])
dkretz.. 5
更容易(因为它为你下次保存它)...
在Excel中
数据/导入外部数据/新Web查询
会带你到网址提示.输入您的网址,它将分隔要导入的页面上的可用表格.瞧.
在工具的UI中选择HTML表格并将其复制到剪贴板中(如果可能的话)
将其粘贴到Excel中.
保存为CSV文件
但是,这是一种手动解决方案而非自动化解决方案.
使用python:
例如,假设你想从一些这样的网站凑在CSV形式的外汇报价:fxquotes
然后...
from BeautifulSoup import BeautifulSoup import urllib,string,csv,sys,os from string import replace date_s = '&date1=01/01/08' date_f = '&date=11/10/08' fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us' fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1' cur1,cur2 = 'USD','AUD' fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1 fx_url = fx_url +'&expr=' + cur2 + '&expr2=' + cur2 + fx_url_end data = urllib.urlopen(fx_url).read() soup = BeautifulSoup(data) data = str(soup.findAll('pre', limit=1)) data = replace(data,'[','') data = replace(data,']','') file_location = '/Users/location_edit_this' file_name = file_location + 'usd_aus.csv' file = open(file_name,"w") file.write(data) file.close()
编辑:从表中获取值:示例来自:palewire
from mechanize import Browser from BeautifulSoup import BeautifulSoup mech = Browser() url = "http://www.palewire.com/scrape/albums/2007.html" page = mech.open(url) html = page.read() soup = BeautifulSoup(html) table = soup.find("table", border=1) for row in table.findAll('tr')[1:]: col = row.findAll('td') rank = col[0].string artist = col[1].string album = col[2].string cover_link = col[3].img['src'] record = (rank, artist, album, cover_link) print "|".join(record)
这是我使用(当前)最新版本的BeautifulSoup的python版本,可以使用,例如,
$ sudo easy_install beautifulsoup4
该脚本从标准输入读取HTML,并以适当的CSV格式输出所有表中的文本.
#!/usr/bin/python from bs4 import BeautifulSoup import sys import re import csv def cell_text(cell): return " ".join(cell.stripped_strings) soup = BeautifulSoup(sys.stdin.read()) output = csv.writer(sys.stdout) for table in soup.find_all('table'): for row in table.find_all('tr'): col = map(cell_text, row.find_all(re.compile('t[dh]'))) output.writerow(col) output.writerow([])
更容易(因为它为你下次保存它)...
在Excel中
数据/导入外部数据/新Web查询
会带你到网址提示.输入您的网址,它将分隔要导入的页面上的可用表格.瞧.