User Tools

Site Tools


tech:se:nokogiri:nokogiri

Nokogiri

Scraping CP932(Windows-31J) Contents

Specific Shift_JIS code page: CP932.

scrape.rb
#!/usr/bin/env ruby
 
require 'nokogiri'
require 'open-uri'
require 'csv'
 
url = 'https://example.com/foo.html'
 
html = open(url) do |f|
    f.read
end
 
doc = Nokogiri::HTML.parse(html, nil, "CP932")
doc.xpath('//a').each do |node|
  url = node["href"]
  title = node.content
  puts "#{title}\t#{url}"
end

read from command line argument.

scrape.rb
#!/usr/bin/env ruby
 
require 'nokogiri'
require 'open-uri'
require 'csv'
 
url = ARGV[0]
 
html = open(url) do |f|
    f.read
end
 
doc = Nokogiri::HTML.parse(html, nil, "CP932")
doc.xpath('//a').each do |node|
  url = node["href"]
  title = node.content
  puts "#{title}\t#{url}"
end

call following:

bundle exec scrape.rb https://example.com/foo.html
tech/se/nokogiri/nokogiri.txt · Last modified: 2018/03/18 17:14 by wnoguchi