Oct-30-2019, 08:54 AM
So I'm trying to work with a Scrapy project that I found, called RISJbot (GitHub) for extracting the contents of news articles for my research, however I encountered a problem that I can't find the source of, or the way to fix it: the spider can't find body texts (in my case, most importantly: the articles themselves). I tried two of these spiders so far: the CNN one gave mixed results, the Washinton Post one couldn't find a single one.
It gives me this error message:
What would you recommend me to do? I couldn't find a way so far to fix it.
Here's the spider itself (although as you can see, it imports a lot of strings from other files, that's why I shared the GitHub page with you):
It gives me this error message:
Error:ERROR:RISJbot.pipelines.checkcontent:No bodytext: https://www.washingtonpost.com/world/europe/russia-and-cuba-rebuild-ties-that-frayed-after-cold-war/2019/10/29/d046cc0a-fa09-11e9-9e02-1d45cb3dfa8f_story.htmlIt also returns this error message, I'm not sure if it has any link to my problem: Error:ERROR:scrapy.utils.signal:Error caught on signal handler: > Traceback (most recent call last): File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\twisted\internet\defer.py", line 151, in maybeDeferred result = f(*args, **kw) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply return receiver(*arguments, **named) File "C:\Users\sigalizer\Anaconda3\envs\scrapyenv\lib\site-packages\scrapy\extensions\feedexport.py", line 243, in item_scraped slot = self.slot AttributeError: 'FeedExporter' object has no attribute 'slot'When it doesn't find the body text, as a fallback, it generates a gzipped, Base 64-encoded version of the whole page. I managed to turn off this function, to check whether it has any sign of the part I'm looking for, and it indeed has the body text in it (albeit a very distorted version, with all html stuff, but I found a couple words), so it loads in, and it doesn't use JavaScript.What would you recommend me to do? I couldn't find a way so far to fix it.
Here's the spider itself (although as you can see, it imports a lot of strings from other files, that's why I shared the GitHub page with you):
# -*- coding: utf-8 -*-
from RISJbot.spiders.newssitemapspider import NewsSitemapSpider
from RISJbot.loaders import NewsLoader
# Note: mutate_selector_del_xpath is somewhat naughty. Read its docstring.
from RISJbot.utils import mutate_selector_del_xpath
from scrapy.loader.processors import Identity, TakeFirst
from scrapy.loader.processors import Join, Compose, MapCompose
import re
class WashingtonPostSpider(NewsSitemapSpider):
name = 'washingtonpost'
# allowed_domains = ['washingtonpost.com']
# A list of XML sitemap files, or suitable robots.txt files with pointers.
sitemap_urls = ['https://www.washingtonpost.com/news-sitemaps/index.xml']
def parse_page(self, response):
"""@url http://www.washingtonpost.com/business/2019/10/25/us-deficit-hit-billion-marking-nearly-percent-increase-during-trump-era/?hpid=hp_hp-top-table-main_deficit-210pm%3Ahomepage%2Fstory-ans
@returns items 1
@scrapes bodytext bylines fetchtime firstpubtime headline source url
@noscrapes modtime
"""
s = response.selector
# Remove any content from the tree before passing it to the loader.
# There aren't native scrapy loader/selector methods for this.
#mutate_selector_del_xpath(s, '//*[@style="display:none"]')
l = NewsLoader(selector=s)
# WaPo's ISO date/time strings are invalid: <datetime>-500 instead of
# <datetime>-05:00. Note that the various standardised l.add_* methods
# will generate 'Failed to parse data' log items. We've got it properly
# here, so they aren't important.
l.add_xpath('firstpubtime',
'//*[@itemprop="datePublished" or '
'@property="datePublished"]/@content',
MapCompose(self.fix_iso_date)) # CreativeWork
# These are duplicated in the markup, so uniquise them.
l.add_xpath('bylines',
'//div[@itemprop="author-names"]/span/text()',
set)
l.add_xpath('section',
'//*[contains(@class, "headline-kicker")]//text()')
# Add a number of items of data that should be standardised across
# providers. Can override these (for TakeFirst() fields) by making
# l.add_* calls above this line, or supplement gaps by making them
# below.
l.add_fromresponse(response)
l.add_htmlmeta()
l.add_schemaorg(response)
l.add_opengraph()
l.add_scrapymeta(response)
return l.load_item()
def fix_iso_date(self, s):
return re.sub(r'^([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}[+-])'
'([0-9])([0-9]{2})$',
r'\g<1>0\g<2>:\g<3>',
s)
